Closed stxue1 closed 3 months ago
It looks like this might break Slurm in general, though:
[31m[1m________ SlurmTest.test_coalesce_job_exit_codes_sacct_raises_job_exists ________[0m
[gw1] linux -- Python 3.12.0 /builds/databiosphere/toil/venv/bin/python
Traceback (most recent call last):
File "/builds/databiosphere/toil/src/toil/batchSystems/slurm.py", line 236, in _get_job_details
status_dict = self._getJobDetailsFromSacct(job_id_list)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/builds/databiosphere/toil/src/toil/test/batchSystems/test_slurm.py", line 159, in call_sacct_raises
raise CalledProcessErrorStderr(1, "sacct: error: Problem talking to the database: "
toil.lib.misc.CalledProcessErrorStderr: <super: <class 'CalledProcessErrorStderr'>, <CalledProcessErrorStderr object>>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.12/unittest/case.py", line 58, in testPartExecutor
yield
File "/usr/lib/python3.12/unittest/case.py", line 634, in run
self._callTestMethod(testMethod)
File "/usr/lib/python3.12/unittest/case.py", line 589, in _callTestMethod
if method() is not None:
^^^^^^^^
File "/builds/databiosphere/toil/src/toil/test/batchSystems/test_slurm.py", line 389, in test_coalesce_job_exit_codes_sacct_raises_job_exists
result = self.worker.coalesce_job_exit_codes(job_ids)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/builds/databiosphere/toil/src/toil/batchSystems/slurm.py", line 208, in coalesce_job_exit_codes
status_dict = self._get_job_details(job_id_list)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/builds/databiosphere/toil/src/toil/batchSystems/slurm.py", line 238, in _get_job_details
status_dict = self._getJobDetailsFromScontrol(job_id_list)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/builds/databiosphere/toil/src/toil/batchSystems/slurm.py", line 400, in _getJobDetailsFromScontrol
job[key] += ' ' + bits[0]
~~~^^^^^
KeyError: 'NtasksPerTRES:0'
I think the problem is https://github.com/DataBiosphere/toil/pull/4951/files#diff-62547683dfb9de5a967fd3260b78612c1e4fb08470399f06c452bd2c690c526eR399
That loop actually does depend on key being set by one line and then used by other lines later; you can't use the same line's info as both key and value like you are doing.
Will this work for CWL as well?
Will this work for CWL as well?
The Slurm changes should work on the CWL side. The original intention of this PR was to remove the need for a user to supply a partition from TOIL_SLURM_ARGS
if there was a GPU needed job:
https://github.com/DataBiosphere/toil/blob/92a90ee8e69acd520e7da52ec8a7559b44b7d4c9/src/toil/batchSystems/slurm.py#L470-L476
This PR allows Toil to switch between the default partition and a gpu partition by itself, as long as no partition is specified.
So as long as toil-cwl-runner
has the machinery to handle GPUs, this should work.
Will this work for CWL as well?
The Slurm changes should work on the CWL side. The original intention of this PR was to remove the need for a user to supply a partition from
TOIL_SLURM_ARGS
if there was a GPU needed job:This PR allows Toil to switch between the default partition and a gpu partition by itself, as long as no partition is specified.
So as long as
toil-cwl-runner
has the machinery to handle GPUs, this should work.
Great, thank you!
Yes, the CWL side has been setting the gpu
option for a while, if the CUDARequirement
CWL extension is used
https://doc.arvados.org/user/cwl/cwl-extensions.html#CUDARequirement
I updated the changelog to make this clear
Closes #4833 and #4949
Changelog Entry
To be copied to the draft changelog by merger:
gpu
field in WDLReviewer Checklist
issues/XXXX-fix-the-thing
in the Toil repo, or from an external repo.camelCase
that want to be insnake_case
.docs/running/{cliOptions,cwl,wdl}.rst
Merger Checklist