DataBiosphere / toil

A scalable, efficient, cross-platform (Linux/macOS) and easy-to-use workflow engine in pure Python.
http://toil.ucsc-cgl.org/.
Apache License 2.0
894 stars 241 forks source link

Implement *real* per-node cleanup for Slurm specifically #4882

Open adamnovak opened 5 months ago

adamnovak commented 5 months ago

Building on #4775, Slurm right now implements worker cleanup using the fallback base class that just has the last running job on a node clean up.

If we use this with caching on our Slurm cluster, we won't get a good result for workflows that run one job on a node at a time. Each job will launch, download files into the cache, finish working, see it is the last job on the node, and clean up. Then the next job to schedule on the node will get there and have an empty cache and have to fill it again.

We should implement real cleanup for Slurm, instead of using what it inherits from the AbstractGridEngineBatchSystem. We should have the Slurm batch system keep a set of all the Slurm node names that workflow jobs have run on, and at shutdown it should issue special cleanup jobs pre-assigned to each of those nodes, to do the cleanup work.

Since Slurm doesn't schedule based on disk usage, we don't have to worry about not having an active Slurm job to own the cache, at the Slurm level.

We'll still have to deal with Slurm sometimes not sending the next job to the node that the previous job just cached files on. Eventually we might want data gravity. But that's going to need #3071 and will probably be a whole separate system.

┆Issue is synchronized with this Jira Story ┆Issue Number: TOIL-1546

adamnovak commented 5 hours ago

This might be related to or maybe required for #5084.