flux-framework / flux-accounting

bank/accounting interface for the Flux resource manager
https://flux-framework.readthedocs.io/projects/flux-accounting/en/latest/index.html
GNU Lesser General Public License v3.0
3 stars 10 forks source link

[WIP] plugin: add max nodes limit support #444

Closed cmoussa1 closed 3 months ago

cmoussa1 commented 7 months ago

Background

The priority plugin does not implement a max nodes limit to be enforced on an association's set of currently running jobs.


This PR is built on top of #442 and looks to add basic support for enforcing a max nodes limit on an association's set of running jobs. The method for implementing this limit follows the same template as the max running jobs limit; when a job is in job.state.depend, the nodes the submitted job is looking to use is extracted from jobspec and added to the association's current node count. If a job does not specify any nodes, the plugin just assumes a node count of 1 for the job. If the job would put the association over their limit, a dependency is added to the job and it is held until a currently running job finishes and cleans up.

To do this, I've proposed combining the two limits (max running jobs & max nodes) into one named dependency. So, if an association hits either their max running jobs or max node limit(s), the same dependency is added.

In the callback for job.state.inactive, the logic for releasing a held job due to a flux-accounting limit is slightly reworked. If an association is under their max running jobs limit, the first held job is grabbed and its node count is inspected, similar to how it is checked in job.state.depend. If releasing this held job would keep the association under or equal to their max nodes limit, the dependency is removed and the job can move on to being run. If the limit cannot be satisfied, the dependency is not removed and no held jobs are released until another one of the association's currently running jobs finishes and cleans up.

A couple of basic tests are added to 1034-mf-priority-max-nodes.t to simulate submitting jobs that take up all of the association's node limit and having a job held due to their max nodes limit. Once the currently running job finishes, a test checks that the held job transitions to run.

TODO