cluster-status should use squeue --iterate

holtgrewe commented 2 years ago

I'm a long-term Snakemake user, starting out with SGE and DRMAA (a standard actually driven forward by the SGE vendor). We have switched to Slurm quite some time ago and it works fine with DRMAA but has one problem: our controller is draining in an "RPC storm". What is an RPC storm? Many users performing queries such as squeue repetiviely, e.g. as watch squeue. Something similar is true for sacct but this will not hit the controller but rather the slurmdbd.

Why does this matter, you might wonder. Here is why: the cluster status script is using sacct internally, you already know this. You probably use this because the slurm jobs are not visible in squeue if more than MinJobAge have passed.

So, to summarize up to here:

watch squeue is an antipattern in Slurm
the controller is faster to answer queries than the slurm database daemon

Now, what do the wonderful SchedMD recommend? Using the -i/--iterate function. This will make on RPC call and keep it open and then the controller will happily print the queue to you with almost no hit on the controller. Chosing an -i value small enough will also ensure that snakemake can know all status updates.

Example output below.

Fri Feb 11 16:32:19 2022
JOBID,STATE
122577,PENDING
98589,PENDING
98588,PENDING
98587,PENDING

Fri Feb 11 16:32:24 2022
JOBID,STATE
122577,PENDING
98589,PENDING
98588,PENDING
98587,PENDING

What are the actionables that I propose, you might ask.

Good question. What we would need is to

launch an squeue --me -i 10 --format='%i,%T' at startup in a background thread
have the thread parse through the output and memoize the job ids in a dict
have cluster-status query that dict

One way to implement this is to have the slurm profile actually serve the cache values through a micro REST API.

percyfal commented 2 years ago

Hi @holtgrewe , thanks for the illuminating comment - I wasn't aware of the --iterate option! I have not encountered the problems you describe at our HPC, but I have had the nagging feeling that something like this is bound to happen if enough users execute these commands.

Could you elaborate more specifically what changes would be needed here and how they relate to the snakemake pull request you reference? The sidecar process would launch squeue (or any command of choice) and cluster-status would then communicate with the process? Is there anything that could already be implemented in the profile, or should I just hang tight until the PR is complete?

holtgrewe commented 2 years ago

Hi. I stand corrected. According to Slurm support, doing watch -n 10 squeue is the same as squeue -i 10. However caching the squeue output should be better than many saccts. Let me see my snakemake prs through and I will continue here.

holtgrewe commented 2 years ago

Addressed by #85. This will need Snakemake v7 and the PR mentioned in the ticket.

As an alternative, administration can setup a central slurmrestd behind a caching proxy and slurm-status.py could query this. As I don't have setup slurmrestd centrally yet and it requires administration intervation, I exclude this from the scope here.

Snakemake-Profiles / slurm

cluster-status should use squeue --iterate #81