flux-framework / flux-core

core services for the Flux resource management framework
GNU Lesser General Public License v3.0
167 stars 49 forks source link

is there a way to count/estimate the number of cores-per-node on a system instance? #6091

Open cmoussa1 opened 2 months ago

cmoussa1 commented 2 months ago

I'm trying to make some progress on flux-framework/flux-accounting#349 and, as a start, am just trying to see if I can reliably count or estimate the number of nodes used by a job.

If a user does not specify the number of nodes for their job, jobspec will report 0 for nnodes. If we know how many cores-per-node there are on a system (particularly node-exclusive), however, we might be able to just count the number of cores reported by jobspec and convert this to nnodes to use for accounting. This probably won't work for systems that do not have the same number of cores-per-node across all its nodes.

Another potential option comes from an offline conversation with @ryanday36:

If ncores is getting set to the actual cores per node, we could just do a total cores across all of a users jobs limit.

Although I don't believe this is the case in jobspec, perhaps there is somewhere else where we could query and store this information to be used.

grondo commented 2 months ago

I think the most complete solution might be to require both a cores and nodes limit for jobs, and if either is exceeded the job is rejected. This is what we ended up doing with the flux-core policy limits. This is mentioned in a note in flux-config-policy(5):

   NOTE:
     Limit checks take place before the scheduler sees the request, so it
     is possible to bypass a node limit by requesting only cores, or  the
     core limit by requesting only nodes (exclusively) since this part of
     the system does not have detailed resource  information.   Generally
     node  and core limits should be configured in tandem to be effective
     on resource sets with uniform cores per node.   Flux  does  not  yet
     have a solution for node/core limits on heterogeneous resources.
cmoussa1 commented 2 months ago

OK, that might be an OK start. Are you thinking the limit would be represented like:

resources = nnodes + ncores

or something different? and if a job might exceed a max_resources, hold the job?

grondo commented 2 months ago

I was thinking you'd check both values and if either exceeded the configured limit then the job is rejected. If you can't tell how much of a resource is in the jobspec, then just skip that test. That way you are always checking at least one limit.

cmoussa1 commented 2 months ago

I think the goal (at least for accounting) here is to be able to enforce a resource limit across all of a user's running jobs. If we go with the above, if a job will exceed either limit (ncores or nnodes), I believe the job should be held until the user goes back under their limit. @ryanday36 should correct me if I am wrong, however.

But maybe we could just add a max_ncores limit to all user rows and check both like you mentioned??

ryanday36 commented 2 months ago

The goal is to add up the resource usage of all of a users running jobs and prevent them from starting a new job if their resource usage would exceed some limit. The approach that Mark is suggesting sounds like it would work in most cases. In principle a user could exceed the limit by submitting some jobs that only specify nodes and others that only specify cores, but that would probably be rare in practice. Another possible place where these limits wouldn't work would be jobs that specify a number of cores that isn't an even multiple of the number of cores per node on node exclusive clusters, as it sounds like those jobs will effectively reserve more cores than the ncores in the jobspec. Once again though, I don't know how common that scenario actually is in practice. The most common case for this is probably -n 1, and we could pick that up easily by always assuming a running job is using at least one node, but I could also see someone submitting a bunch of -n 40 jobs on a cluster with 36 cores per node and getting many more jobs than the limit because those aren't being as using 72 cores.

grondo commented 2 months ago

Very good points @ryanday36. @cmoussa1 apologies, I had lost sight of the overall goal for the accounting limits. My apologies. I think you seem to be headed on the right track

cmoussa1 commented 2 months ago

No problem @grondo - I probably should've given more background as to why the limit needed to be there in the first place. So it sounds like we should keep separate counts of both ncores and nnodes across a user's set of running jobs?

This is mainly why I asked if there was a function to gather total node/core counts on a system with flux resource info. 🙂 With this, I could at least estimate a cores-per-node count for that system, and when a user only specifies cores, it could be converted to a rough nnodes count. I understand this might not be entirely accurate, especially for systems where there is not a uniform cores-per-node count across all nodes, but perhaps it's an okay start? Sorry if I am still misunderstanding.

(actually, now that I think about it, if the above sounds okay, then I'm not sure keeping track of ncores across a user's set of running jobs is entirely necessary since we would be converting to nnodes)

ryanday36 commented 2 months ago

just to continue to be a pain here, if we're going to convert things, it probably makes more sense to convert to ncores because that will work for both node exclusive and non-exclusive clusters (assuming we can tell if a cluster has a nodex match-policy and we can properly convert to the actual number of cores reserved for a given -n on nodex clusters).

grondo commented 2 months ago

To answer your original question, you can get access to the resources in an instance by fetching resource.R from the KVS. You'll have to parse the result yourself though, we don't currently export an API to do that (though we've talked about it). The format for R is described in RFC 20

cmoussa1 commented 2 months ago

Thanks for the advice here. After some time playing around with this I think I was able to get somewhere. I've opened a PR over in flux-framework/flux-accounting#469 that proposes adding some work during plugin initialization where it tries to at least estimate the cores-per-node on the system it's loaded on by fetching resource.R. This could be a start to actually keeping track/estimating of a jobs' resources later on. Let me know what you think.