Open natefoo opened 2 years ago
HTCondor has a similar feature, we use it on EU. By setting accounting_group_user
we get the same effect within htcondor where it gives priority to people who haven't started so many jobs. https://github.com/usegalaxy-eu/infrastructure-playbook/blob/master/files/galaxy/dynamic_rules/usegalaxy/sorting_hat.py#L401
$ condor_userprio | head
Last Priority Update: 10/11 09:31
Effective Priority Res Total Usage Time Since
User Name Priority Factor In Use (wghted-hrs) Last Usage
------------------------ ------------ --------- ------ ------------ ----------
a@bi.uni-freiburg.de 500.08 1000.00 1 0.04 0+00:04
b@bi.uni-freiburg.de 500.16 1000.00 1 0.04 0+00:01
c@bi.uni-freiburg.de 500.24 1000.00 1 0.05 0+00:03
d@bi.uni-freiburg.de 500.24 1000.00 1 165.48 0+00:03
e@bi.uni-freiburg.de 500.24 1000.00 1 199.91 0+00:02
f@bi.uni-freiburg.de 500.24 1000.00 1 14313.75 0+00:04
Galaxy's internal concurrency limits are necessary because in most environments, Galaxy runs jobs as a single user, so you can't use your DRM's internal fairness algorithms for prioritizing users' jobs against each other. Without this, one user can stuff the DRM queue and all other users will have to wait for the first user's jobs to finish before their jobs will run.
Galaxy's concurrency limits also give you some additional features, like the ability to limit at a meta-level (e.g. across multiple clusters). However, in the common case where there's only one cluster (or one cluster does most jobs), it would be more powerful and can provide better queue-time feedback to users if we could just dump all the jobs into Slurm and have it sort them out.
Running jobs as real system users requires some hacks and isn't possible for public servers, but Slurm Accounting can be used to essentially achieve the same effect: each user could be added as an "account" in the Slurm accounting database, and then if we submit jobs to Slurm using
--account
, Slurm resource limits, which are way more powerful, can be used instead of Galaxy's. In addition, fair share prioritization can be used to give higher priority to users who use fewer resources.Things that would be necessary in order to make this happen:
--account
param on job submissionThis is not a super high priority for me yet simply because I don't have one big cluster, I have 5 small ones, but in the future, with Singularity-based Pulsar coexecution environments (which will allow for hybrid local/cloud clusters on a single Slurm controller), this would be a killer feature.