Closed jluethi closed 11 months ago
One thing I'm currently following up on is whether the sudo calls occurred more during running of 15 plates that finished very fast or during running of a Cellpose task that ran way longer than expected (we're not sure yet what caused that, will be a separate thing to look into)
Here is a first quick comment, and I'll follow up later.
We are using the interval
attribute of the SLURM Executor as defined from clusterfutures, which defaults to 1 second. That's the interval between two subsequent calls of commands like
sudo -u user ls \
/somewhere/2.well_1.out.pickle \
/somewhere/2.well_2.out.pickle \
/somewhere/2.well_3.out.pickle \
A very obvious mitigation strategy is that we can expose this variable as a fractal-server configuration variable, and then set it to a larger interval (e.g. 5 seconds). This will immediately improve the number of sudo-ls calls, at the price of possibly increasing task execution by that same interval (e.g. 5 seconds).
Note that we already do this with https://fractal-analytics-platform.github.io/fractal-server/configuration/#fractal_server.config.Settings.FRACTAL_SLURM_POLL_INTERVAL, to avoid calling squeue
too frequently.
Also note that this mitigation strategy clearly does not affect the scaling with the number of parallel components (e.g. wells).
Some clarifications about how many sudo calls are made.
FractalSlurmExecutor
object, which is used until the job succeeds/fails and then discarded.FractalSlurmExecutor
is used for all tasks that are part of a given job, including all parallel instances of a per-well or per-image task.FractalSlurmExecutor
performs existence check of the associated files at regular intervals. When checking for existence of one file, this corresponds to a single sudo-ls call. When checking for existence of M=20 files, this also corresponds to a single sudo-ls call (like sudo -u user ls file1 file2 file3 ...
).Then my first answers (let's also re-discuss this more in detail) would be:
Can we evaluate how sudo calls currently scale with a) job duration (are we continuously checking for the output until it comes up?)
Linearly: the longer the task, the more sudo-ls calls we do
and b) workflow size (as in number of wells to be processed => workflow tasks that are submitted. Not in slurm jobs, those are under control)
For now I'd say that the number of sudo-ls calls scale with the number of SLURM jobs. If all N=100 wells are processed as part of a single SLURM job, I'd expect the number of sudo-ls calls to be equivalent to the case of a non-parallel task (apart from the fact that total runtime will be longer, and then sudo-ls calls will be more).
A quick comment / idea to improve on the poll design: calling sudo every few second is, generally speaking, a quite bad idea. It pollutes the logs and make auditing very hard.
Can you instead execute a shell (sudo bash, for example) or a persistent custom made program to "listen" on the other side and give you the results? Also: you know you can use file notification ( https://en.wikipedia.org/wiki/Inotify ) to monitor for file changes. There is a limit how many you can do, but you can use that rather than polling.
Otherwise you can simply run squeue every few seconds without the need of sudo at all. That is probably even simpler and SLURM should be able to handle that number of RPC (assuming there are several tens of active users at a given time this would only result in a few requests per second).
I assume the SLURM jobs terminates immediately as soon as the work is done, the delays is simply in submitting the next step, right? Even a 30 seconds delay (which is not needed we can stay at 5 / 10 seconds) would not be a big problem considering the overall length of the processing, which is the hours range.
Thank you. Kind regards.
Our basic assumption here was that calls to ls
(even with sudo
, and even in large numbers) would be "cheaper" than calls to squeue
, and that's why we set the former to be very frequent and the latter to be relatively rare. If each sudo-ls call has side effects (i.e. logging), then the assumption is not valid any more.
In #885, I'm proposing to fully remove the sudo-ls mechanism for checking whether a task is over, and only rely on the presence/absence of the job in the squeue
output. This also has the advantage that a single squeue
commands is needed even when a parallel task is using N SLURM jobs, while for the file-existence-based checks we would need N sudo-ls calls.
If we set the squeue-polling interval to e.g. 5 seconds and take the example of 20 active users with 5 workflows running at a given time, then fractal-server
will make approximately 20 squeue
calls per second, which seems reasonable - unless the SLURM API has specific limitations.
A few additional notes, for the record:
FRACTAL_SLURM_ERROR_HANDLING_INTERVAL
- as it's not only related to output files.If you want to check the status of a job you can also use scontrol show job JOBID
, on some SLURM configs though this will always not be present, but normally it's kept for a while (>minutes, 5 days in FMI's config):
# scontrol show job JOBID
JobId=JOBID JobName=Convert_to_OME-Zarr
UserId=johndoe(1000) GroupId=johndoe(1000) MCS_label=N/A
Priority=303 Nice=0 Account=fractal QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
...
Very recent version of SLURM also support printing the information in JSON or yaml format.
This allows you not only to check if a job is completed, but also if it's stuck, if it failed and, sometimes, why (e.g. OUT_OF_MEMORY).
Maybe an improvement for the future.
Thank you for the quick change.
Our current slurm runner makes many sudo calls to check for the out pickle files. Those commands look like this:
(and often often for more than 1 pickle at once)
When users scale Fractal usage to processing 10s of plates with hundreds of wells at once, this leads to sudo ls calls every 1-2 seconds. Given that some institutions have stricter logs of sudo usage, that then leads to millions of lines of
sudo ls *.pickle
in the logs, thus overflowing sudo log storage.Apparently, the issue are the sudo ls, not the cat/copy (there are probably quite a few of those?) or the user impersonation to submit the slurm (there are relatively few of those due to job batching).
@tcompa What's the current strategy of using sudo ls to check for the existence of the pickle files? Can we evaluate how sudo calls currently scale with a) job duration (are we continuously checking for the output until it comes up?) and b) workflow size (as in number of wells to be processed => workflow tasks that are submitted. Not in slurm jobs, those are under control)