slurm runner: Reduce number of necessary sudo ls calls

Our current slurm runner makes many sudo calls to check for the out pickle files. Those commands look like this:

ls /path/to/.fractal-
cache/proj_00000x_wf_00000x_job_0000x_20231001_183800/2_par_Plate_zarr_C_6_0__out_Aoa3zrkVVTnNwyFmLAIUZiCCZ8ypVth9.pickle

(and often often for more than 1 pickle at once)

When users scale Fractal usage to processing 10s of plates with hundreds of wells at once, this leads to sudo ls calls every 1-2 seconds. Given that some institutions have stricter logs of sudo usage, that then leads to millions of lines of sudo ls *.pickle in the logs, thus overflowing sudo log storage.

Apparently, the issue are the sudo ls, not the cat/copy (there are probably quite a few of those?) or the user impersonation to submit the slurm (there are relatively few of those due to job batching).

@tcompa What's the current strategy of using sudo ls to check for the existence of the pickle files? Can we evaluate how sudo calls currently scale with a) job duration (are we continuously checking for the output until it comes up?) and b) workflow size (as in number of wells to be processed => workflow tasks that are submitted. Not in slurm jobs, those are under control)

One thing I'm currently following up on is whether the sudo calls occurred more during running of 15 plates that finished very fast or during running of a Cellpose task that ran way longer than expected (we're not sure yet what caused that, will be a separate thing to look into)

Here is a first quick comment, and I'll follow up later.

We are using the interval attribute of the SLURM Executor as defined from clusterfutures, which defaults to 1 second. That's the interval between two subsequent calls of commands like

sudo -u user ls \
   /somewhere/2.well_1.out.pickle \
   /somewhere/2.well_2.out.pickle \
   /somewhere/2.well_3.out.pickle \

A very obvious mitigation strategy is that we can expose this variable as a fractal-server configuration variable, and then set it to a larger interval (e.g. 5 seconds). This will immediately improve the number of sudo-ls calls, at the price of possibly increasing task execution by that same interval (e.g. 5 seconds).

Note that we already do this with https://fractal-analytics-platform.github.io/fractal-server/configuration/#fractal_server.config.Settings.FRACTAL_SLURM_POLL_INTERVAL, to avoid calling squeue too frequently.

Also note that this mitigation strategy clearly does not affect the scaling with the number of parallel components (e.g. wells).

Some clarifications about how many sudo calls are made.

Each workflow execution in Fractal (AKA each call to the apply endpoint) results into the creation of a "disposable" FractalSlurmExecutor object, which is used until the job succeeds/fails and then discarded.
The same FractalSlurmExecutor is used for all tasks that are part of a given job, including all parallel instances of a per-well or per-image task.
Each non-parallel task is stored into a "waiting list", and associated to one file.
For a parallel tasks with N components, what matters is the number of SLURM jobs (determined based on the fractla-server JSON SLURM configuration file). If N=100 components are split into M=5 SLURM jobs (with 20 components each), then M=5 entries are added to the waiting list (each one associated to a list of 20 files).
For item in the waiting list (which can correspond to one or more files), the FractalSlurmExecutor performs existence check of the associated files at regular intervals. When checking for existence of one file, this corresponds to a single sudo-ls call. When checking for existence of M=20 files, this also corresponds to a single sudo-ls call (like sudo -u user ls file1 file2 file3 ...).
The time interval is constant over time, meaning that the number of sudo-ls for a given (running) workflow increases linearly with the task duration.
Whenever a SLURM job is over (for any reason), the corresponding item of the waiting list is removed.

Then my first answers (let's also re-discuss this more in detail) would be:

Can we evaluate how sudo calls currently scale with a) job duration (are we continuously checking for the output until it comes up?)

Linearly: the longer the task, the more sudo-ls calls we do

and b) workflow size (as in number of wells to be processed => workflow tasks that are submitted. Not in slurm jobs, those are under control)

For now I'd say that the number of sudo-ls calls scale with the number of SLURM jobs. If all N=100 wells are processed as part of a single SLURM job, I'd expect the number of sudo-ls calls to be equivalent to the case of a non-parallel task (apart from the fact that total runtime will be longer, and then sudo-ls calls will be more).

A quick comment / idea to improve on the poll design: calling sudo every few second is, generally speaking, a quite bad idea. It pollutes the logs and make auditing very hard.

Can you instead execute a shell (sudo bash, for example) or a persistent custom made program to "listen" on the other side and give you the results? Also: you know you can use file notification ( https://en.wikipedia.org/wiki/Inotify ) to monitor for file changes. There is a limit how many you can do, but you can use that rather than polling.

Otherwise you can simply run squeue every few seconds without the need of sudo at all. That is probably even simpler and SLURM should be able to handle that number of RPC (assuming there are several tens of active users at a given time this would only result in a few requests per second).

I assume the SLURM jobs terminates immediately as soon as the work is done, the delays is simply in submitting the next step, right? Even a 30 seconds delay (which is not needed we can stay at 5 / 10 seconds) would not be a big problem considering the overall length of the processing, which is the hours range.

Thank you. Kind regards.

Our basic assumption here was that calls to ls (even with sudo, and even in large numbers) would be "cheaper" than calls to squeue, and that's why we set the former to be very frequent and the latter to be relatively rare. If each sudo-ls call has side effects (i.e. logging), then the assumption is not valid any more.

In #885, I'm proposing to fully remove the sudo-ls mechanism for checking whether a task is over, and only rely on the presence/absence of the job in the squeue output. This also has the advantage that a single squeue commands is needed even when a parallel task is using N SLURM jobs, while for the file-existence-based checks we would need N sudo-ls calls.

If we set the squeue-polling interval to e.g. 5 seconds and take the example of 20 active users with 5 workflows running at a given time, then fractal-server will make approximately 20 squeue calls per second, which seems reasonable - unless the SLURM API has specific limitations.

A few additional notes, for the record:

This change may introduce (in the worst-case) a 5-seconds overhead on each task of a workflow (compared to the 1-second overhead we have now). That is, a 5-tasks workflow may have (in the worst case) 25 seconds of overhead. To me this looks OK.
We also include an additional overhead for when something fails, in the https://fractal-analytics-platform.github.io/fractal-server/configuration/#fractal_server.config.Settings.FRACTAL_SLURM_OUTPUT_FILE_GRACE_TIME variable. When the SLURM job is over but the output file is missing, we wait this grace time (e.g. other 5 seconds) and re-check whether the file is there. This was introduced to handle the case of a very slow filesystem, and it is probably not very relevant at the moment. Still, we can keep it (and set it to a small value, like 5 seconds), because it also covers the time it may take for writing the SLURM stdout/stderr files to disk. We should probably also rename it to something like FRACTAL_SLURM_ERROR_HANDLING_INTERVAL - as it's not only related to output files.
As part of the PR #885, we are also removing an additional waiting interval (https://fractal-analytics-platform.github.io/fractal-server/configuration/#fractal_server.config.Settings.FRACTAL_SLURM_KILLWAIT_INTERVAL), which was related to the killwait SLURM parameter (i.e. the time SLURM waits between sending a sigterm and then a sigkill). Upon reviewing this, we don't think it matters for fractal-server whether a killwait interval is in process, but we only need to know the SLURM job status.

If you want to check the status of a job you can also use scontrol show job JOBID, on some SLURM configs though this will always not be present, but normally it's kept for a while (>minutes, 5 days in FMI's config):

# scontrol show job JOBID
JobId=JOBID JobName=Convert_to_OME-Zarr
   UserId=johndoe(1000) GroupId=johndoe(1000) MCS_label=N/A
   Priority=303 Nice=0 Account=fractal QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
...

Very recent version of SLURM also support printing the information in JSON or yaml format.

This allows you not only to check if a job is completed, but also if it's stuck, if it failed and, sometimes, why (e.g. OUT_OF_MEMORY).

Maybe an improvement for the future.

Thank you for the quick change.

fractal-analytics-platform / fractal-server

slurm runner: Reduce number of necessary sudo ls calls #884