flux-framework / flux-sched

Fluxion Graph-based Scheduler
GNU Lesser General Public License v3.0
84 stars 39 forks source link

job submissions are serialized and not interactively performant #1159

Open adamdbertsch opened 1 month ago

adamdbertsch commented 1 month ago

Job submissions are serialized across a large system, and each job submission takes order 10s to complete. This means that submitting more than a handful of jobs can take minutes, and all other users are blocked from submitting their own jobs during this time.

It appears that the same is true for flux resource commands, and that they share the same "big lock" as flux job commands. A single flux job submission behind a flux resource list command on a large system took 32s to complete.

grondo commented 1 month ago

Job submissions are serialized across a large system, and each job submission takes order 10s to complete

There are multiple issues going on here which may make job submissions appear to be serialized:

If the firstnodex policy doesn't resolve the slow submission performance, we may want to disable the feasibility plugin until feasibility performance issues can be addressed by Fluxion developers.

Related #1001

It appears that the same is true for flux resource commands, and that they share the same "big lock" as flux job commands

There is no "big lock", however the current version of flux-resource does need to query the scheduler for the scheduler state of resources. Since the scheduler is single threaded, it can only do one thing at once.

This was solved by flux-framework/flux-core#5796.

There are also some performance issues in flux-resource itself which were addressed in flux-framework/flux-core#5823 and flux-framework/flux-core#5824.

The flux-resource performance fixes will be available in flux-core v0.61.0, which is scheduled to be released 2024-04-02.

grondo commented 1 month ago

There are also some minor improvements in flux-sched v0.33.0. I think the system in question is still at v0.32.0.