DataBiosphere / toil

A scalable, efficient, cross-platform (Linux/macOS) and easy-to-use workflow engine in pure Python.
http://toil.ucsc-cgl.org/.
Apache License 2.0
901 stars 240 forks source link

Slurm toil.Job: update _cores, _memory, _disk with the actually allocated amounts #2395

Open mr-c opened 6 years ago

mr-c commented 6 years ago

updated prior to calling run()

For example, slurm sets the following environment variables:

SLURM_CPUS_ON_NODE SLURM_MEM_PER_NODE

┆Issue is synchronized with this Jira Story ┆friendlyId: TOIL-281

cricketsloan commented 5 years ago

➤ Adam Novak commented:

Aren’t those values supposed to be the amount of cores/memory/disk that the job in particular is allowed to use, and not the resources present on the node? Because multiple jobs can be running on a node at the same time and sharing the node’s resources.

Unless the underlying scheduler has a way to say something like “hey, I know you only asked for 12 GB of disk, but actually I have 100 GB of disk here that I can guarantee you nobody else will want to use while you are running, so you can make use of it to speed yourself up”, then I don’t see a real use case for passing any requirements info to the job other than what it was actually scheduled with in Toil.

mr-c commented 5 years ago

As in the original post, slurm does provide this information

SLURM_CPUS_ON_NODE Number of CPUS on the allocated node.

adamnovak commented 5 years ago

OK, I found some documentation for SLURM_CPUS_ON_NODE that supports interpreting that variable not as the CPUs that the machine has but as the CPUs that the job has been allocated. So it would make sense to expose that to Toil jobs, so they can opportunistically use more cores when available.

I don't see any evidence that SLURM_MEM_PER_NODE has the same opportunistic semantics, though. The documentation says that it is "Same as --mem", which suggests to me that it won't change even if more memory than was requested is available. And I don't see any variables for disk or --tmp at all.

So we do want a system to expose the actual available resources to Toil jobs, in the case where the batch system provides more than requested, but it seems like for Slurm we would only actually be able to do CPUs. On the other hand, we could also use it when doing chaining, and make the whole parent job's allocation visible to the chained jobs.

Do we want to just clobber the record of the requirements the job was initially created with? Workflows might be using that information to schedule more jobs, in which case one job being opportunistically given more resources would result in later jobs claiming to require that larger allocation, when they don't really need it.

On the other hand, not just clobbering the requirements the job was initially created with would be more complex, because we'd have to introduce some new way to get the actually-available resources.

adamnovak commented 3 years ago

We could do this for chaining too: the chained job can use the chaining job's resources if they were greater, so we could update cores, memory, and disk on the job so it sees the larger allocation available.

We would add some code in the worker to update the fields on jobs, just before they run, if these environment variables are set or if they are being chained.

We might want to have the cores, memory and disk accessors consult some non-serialized temporary override fields to use to do this, so that we don't somehow persist this job embiggening back to the job store.