Open adamdbertsch opened 1 month ago
Job submissions are serialized across a large system, and each job submission takes order 10s to complete
There are multiple issues going on here which may make job submissions appear to be serialized:
hinodex
and lonodex
have known performance issues. We should ensure the system is using firstnodex
If the firstnodex
policy doesn't resolve the slow submission performance, we may want to disable the feasibility plugin until feasibility performance issues can be addressed by Fluxion developers.
Related #1001
It appears that the same is true for flux resource commands, and that they share the same "big lock" as flux job commands
There is no "big lock", however the current version of flux-resource does need to query the scheduler for the scheduler state of resources. Since the scheduler is single threaded, it can only do one thing at once.
This was solved by flux-framework/flux-core#5796.
There are also some performance issues in flux-resource
itself which were addressed in flux-framework/flux-core#5823 and flux-framework/flux-core#5824.
The flux-resource performance fixes will be available in flux-core v0.61.0, which is scheduled to be released 2024-04-02.
There are also some minor improvements in flux-sched v0.33.0. I think the system in question is still at v0.32.0.
Job submissions are serialized across a large system, and each job submission takes order 10s to complete. This means that submitting more than a handful of jobs can take minutes, and all other users are blocked from submitting their own jobs during this time.
It appears that the same is true for flux resource commands, and that they share the same "big lock" as flux job commands. A single flux job submission behind a flux resource list command on a large system took 32s to complete.