DataBiosphere / toil

A scalable, efficient, cross-platform (Linux/macOS) and easy-to-use workflow engine in pure Python.
http://toil.ucsc-cgl.org/.
Apache License 2.0
894 stars 241 forks source link

Grab GPU partition information from Slurm + Support for `gpu` field in WDL #4951

Closed stxue1 closed 3 months ago

stxue1 commented 4 months ago

Closes #4833 and #4949

Changelog Entry

To be copied to the draft changelog by merger:

Reviewer Checklist

Merger Checklist

adamnovak commented 3 months ago

It looks like this might break Slurm in general, though:

________ SlurmTest.test_coalesce_job_exit_codes_sacct_raises_job_exists ________
[gw1] linux -- Python 3.12.0 /builds/databiosphere/toil/venv/bin/python
Traceback (most recent call last):
  File "/builds/databiosphere/toil/src/toil/batchSystems/slurm.py", line 236, in _get_job_details
    status_dict = self._getJobDetailsFromSacct(job_id_list)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/builds/databiosphere/toil/src/toil/test/batchSystems/test_slurm.py", line 159, in call_sacct_raises
    raise CalledProcessErrorStderr(1, "sacct: error: Problem talking to the database: "
toil.lib.misc.CalledProcessErrorStderr: <super: <class 'CalledProcessErrorStderr'>, <CalledProcessErrorStderr object>>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.12/unittest/case.py", line 58, in testPartExecutor
    yield
  File "/usr/lib/python3.12/unittest/case.py", line 634, in run
    self._callTestMethod(testMethod)
  File "/usr/lib/python3.12/unittest/case.py", line 589, in _callTestMethod
    if method() is not None:
       ^^^^^^^^
  File "/builds/databiosphere/toil/src/toil/test/batchSystems/test_slurm.py", line 389, in test_coalesce_job_exit_codes_sacct_raises_job_exists
    result = self.worker.coalesce_job_exit_codes(job_ids)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/builds/databiosphere/toil/src/toil/batchSystems/slurm.py", line 208, in coalesce_job_exit_codes
    status_dict = self._get_job_details(job_id_list)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/builds/databiosphere/toil/src/toil/batchSystems/slurm.py", line 238, in _get_job_details
    status_dict = self._getJobDetailsFromScontrol(job_id_list)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/builds/databiosphere/toil/src/toil/batchSystems/slurm.py", line 400, in _getJobDetailsFromScontrol
    job[key] += ' ' + bits[0]
    ~~~^^^^^
KeyError: 'NtasksPerTRES:0'

I think the problem is https://github.com/DataBiosphere/toil/pull/4951/files#diff-62547683dfb9de5a967fd3260b78612c1e4fb08470399f06c452bd2c690c526eR399

That loop actually does depend on key being set by one line and then used by other lines later; you can't use the same line's info as both key and value like you are doing.

mr-c commented 3 months ago

Will this work for CWL as well?

stxue1 commented 3 months ago

Will this work for CWL as well?

The Slurm changes should work on the CWL side. The original intention of this PR was to remove the need for a user to supply a partition from TOIL_SLURM_ARGS if there was a GPU needed job: https://github.com/DataBiosphere/toil/blob/92a90ee8e69acd520e7da52ec8a7559b44b7d4c9/src/toil/batchSystems/slurm.py#L470-L476 This PR allows Toil to switch between the default partition and a gpu partition by itself, as long as no partition is specified.

So as long as toil-cwl-runner has the machinery to handle GPUs, this should work.

mr-c commented 3 months ago

Will this work for CWL as well?

The Slurm changes should work on the CWL side. The original intention of this PR was to remove the need for a user to supply a partition from TOIL_SLURM_ARGS if there was a GPU needed job:

https://github.com/DataBiosphere/toil/blob/92a90ee8e69acd520e7da52ec8a7559b44b7d4c9/src/toil/batchSystems/slurm.py#L470-L476

This PR allows Toil to switch between the default partition and a gpu partition by itself, as long as no partition is specified.

So as long as toil-cwl-runner has the machinery to handle GPUs, this should work.

Great, thank you!

Yes, the CWL side has been setting the gpu option for a while, if the CUDARequirement CWL extension is used

https://github.com/DataBiosphere/toil/blob/92a90ee8e69acd520e7da52ec8a7559b44b7d4c9/src/toil/cwl/cwltoil.py#L2437-L2447

https://doc.arvados.org/user/cwl/cwl-extensions.html#CUDARequirement

I updated the changelog to make this clear