Open cmoussa1 opened 2 months ago
I think the most complete solution might be to require both a cores and nodes limit for jobs, and if either is exceeded the job is rejected. This is what we ended up doing with the flux-core policy limits. This is mentioned in a note in flux-config-policy(5):
NOTE: Limit checks take place before the scheduler sees the request, so it is possible to bypass a node limit by requesting only cores, or the core limit by requesting only nodes (exclusively) since this part of the system does not have detailed resource information. Generally node and core limits should be configured in tandem to be effective on resource sets with uniform cores per node. Flux does not yet have a solution for node/core limits on heterogeneous resources.
OK, that might be an OK start. Are you thinking the limit would be represented like:
resources = nnodes + ncores
or something different? and if a job might exceed a max_resources
, hold the job?
I was thinking you'd check both values and if either exceeded the configured limit then the job is rejected. If you can't tell how much of a resource is in the jobspec, then just skip that test. That way you are always checking at least one limit.
I think the goal (at least for accounting) here is to be able to enforce a resource limit across all of a user's running jobs. If we go with the above, if a job will exceed either limit (ncores
or nnodes
), I believe the job should be held until the user goes back under their limit. @ryanday36 should correct me if I am wrong, however.
But maybe we could just add a max_ncores
limit to all user rows and check both like you mentioned??
The goal is to add up the resource usage of all of a users running jobs and prevent them from starting a new job if their resource usage would exceed some limit. The approach that Mark is suggesting sounds like it would work in most cases. In principle a user could exceed the limit by submitting some jobs that only specify nodes and others that only specify cores, but that would probably be rare in practice. Another possible place where these limits wouldn't work would be jobs that specify a number of cores that isn't an even multiple of the number of cores per node on node exclusive clusters, as it sounds like those jobs will effectively reserve more cores than the ncores
in the jobspec. Once again though, I don't know how common that scenario actually is in practice. The most common case for this is probably -n 1
, and we could pick that up easily by always assuming a running job is using at least one node, but I could also see someone submitting a bunch of -n 40
jobs on a cluster with 36 cores per node and getting many more jobs than the limit because those aren't being as using 72 cores.
Very good points @ryanday36. @cmoussa1 apologies, I had lost sight of the overall goal for the accounting limits. My apologies. I think you seem to be headed on the right track
No problem @grondo - I probably should've given more background as to why the limit needed to be there in the first place. So it sounds like we should keep separate counts of both ncores
and nnodes
across a user's set of running jobs?
This is mainly why I asked if there was a function to gather total node/core counts on a system with flux resource info
. 🙂 With this, I could at least estimate a cores-per-node count for that system, and when a user only specifies cores, it could be converted to a rough nnodes
count. I understand this might not be entirely accurate, especially for systems where there is not a uniform cores-per-node count across all nodes, but perhaps it's an okay start? Sorry if I am still misunderstanding.
(actually, now that I think about it, if the above sounds okay, then I'm not sure keeping track of ncores
across a user's set of running jobs is entirely necessary since we would be converting to nnodes
)
just to continue to be a pain here, if we're going to convert things, it probably makes more sense to convert to ncores
because that will work for both node exclusive and non-exclusive clusters (assuming we can tell if a cluster has a nodex match-policy and we can properly convert to the actual number of cores reserved for a given -n
on nodex clusters).
To answer your original question, you can get access to the resources in an instance by fetching resource.R
from the KVS. You'll have to parse the result yourself though, we don't currently export an API to do that (though we've talked about it). The format for R is described in RFC 20
Thanks for the advice here. After some time playing around with this I think I was able to get somewhere. I've opened a PR over in flux-framework/flux-accounting#469 that proposes adding some work during plugin initialization where it tries to at least estimate the cores-per-node on the system it's loaded on by fetching resource.R
. This could be a start to actually keeping track/estimating of a jobs' resources later on. Let me know what you think.
I'm trying to make some progress on flux-framework/flux-accounting#349 and, as a start, am just trying to see if I can reliably count or estimate the number of nodes used by a job.
If a user does not specify the number of nodes for their job, jobspec will report
0
fornnodes
. If we know how many cores-per-node there are on a system (particularly node-exclusive), however, we might be able to just count the number of cores reported by jobspec and convert this tonnodes
to use for accounting. This probably won't work for systems that do not have the same number of cores-per-node across all its nodes.Another potential option comes from an offline conversation with @ryanday36:
Although I don't believe this is the case in jobspec, perhaps there is somewhere else where we could query and store this information to be used.