galaxyproject / galaxy

Data intensive science for everyone.
https://galaxyproject.org
Other
1.34k stars 978 forks source link

Support using Slurm Accounting for concurrency limits #12678

Open natefoo opened 2 years ago

natefoo commented 2 years ago

Galaxy's internal concurrency limits are necessary because in most environments, Galaxy runs jobs as a single user, so you can't use your DRM's internal fairness algorithms for prioritizing users' jobs against each other. Without this, one user can stuff the DRM queue and all other users will have to wait for the first user's jobs to finish before their jobs will run.

Galaxy's concurrency limits also give you some additional features, like the ability to limit at a meta-level (e.g. across multiple clusters). However, in the common case where there's only one cluster (or one cluster does most jobs), it would be more powerful and can provide better queue-time feedback to users if we could just dump all the jobs into Slurm and have it sort them out.

Running jobs as real system users requires some hacks and isn't possible for public servers, but Slurm Accounting can be used to essentially achieve the same effect: each user could be added as an "account" in the Slurm accounting database, and then if we submit jobs to Slurm using --account, Slurm resource limits, which are way more powerful, can be used instead of Galaxy's. In addition, fair share prioritization can be used to give higher priority to users who use fewer resources.

Things that would be necessary in order to make this happen:

This is not a super high priority for me yet simply because I don't have one big cluster, I have 5 small ones, but in the future, with Singularity-based Pulsar coexecution environments (which will allow for hybrid local/cloud clusters on a single Slurm controller), this would be a killer feature.

hexylena commented 2 years ago

HTCondor has a similar feature, we use it on EU. By setting accounting_group_user we get the same effect within htcondor where it gives priority to people who haven't started so many jobs. https://github.com/usegalaxy-eu/infrastructure-playbook/blob/master/files/galaxy/dynamic_rules/usegalaxy/sorting_hat.py#L401

$ condor_userprio  | head
Last Priority Update: 10/11 09:31
                          Effective   Priority   Res   Total Usage  Time Since
User Name                  Priority    Factor   In Use (wghted-hrs) Last Usage
------------------------ ------------ --------- ------ ------------ ----------
a@bi.uni-freiburg.de       500.08   1000.00      1         0.04    0+00:04
b@bi.uni-freiburg.de       500.16   1000.00      1         0.04    0+00:01
c@bi.uni-freiburg.de       500.24   1000.00      1         0.05    0+00:03
d@bi.uni-freiburg.de       500.24   1000.00      1       165.48    0+00:03
e@bi.uni-freiburg.de       500.24   1000.00      1       199.91    0+00:02
f@bi.uni-freiburg.de       500.24   1000.00      1     14313.75    0+00:04