Open nils-werner opened 2 years ago
I just realized that for my provider, also setting gpus_per_task
fixes the issue
# @package hydra.launcher
defaults:
- submitit_slurm
gpus_per_task: 1
gres: gpu
But the question whether this parameter is really required remains.
The config on Hydra's side is based on the API exposed by the submitit
library.
Looking at the submitit codebase, it seems to me that tasks_per_node
must be an integer (i.e. cannot be null
/None
). In that file, the same default of tasks_per_node==1
is used.
It makes sense to me that, in the BaseQueueConf
file you've linked above, we should have tasks_per_node
typed as int
rather than Optional[int]
because this is what seems to be required by submitit
.
It makes sense to me that, in the BaseQueueConf file you've linked above, we should have tasks_per_node typed as int rather than Optional[int] because this is what seems to be required by submitit.
@Jasha10 not sure what you meant here - it seems the typing is already int
not Optional[int]
@nils-werner thanks for the issue. Since the launcher is built on top of structured config, we'd need the tasks_per_node
present in the conf dataclass. but, as you mentioned, you can customize your launcher config however you want so you don't necessarily need to inherit the launcher config.
Sorry, I think my wording was unclear.
Yes, the typing is currently int
(not Optional[int]
).
What I meant to say is that I think we should keep it that way (because that seems to be what submitit accepts).
🐛 Bug
Description
Submitit currently has the parameter
tasks_per_node = 1
set by default and no way of removing it. Also overloading it withnull
does not seem to be possible.For slurm this parameter is optional, so I am not sure if forcing its existence makes sense here.
The reason I have stumbled across this is because my HPC provider only allows this parameter
--ntasks-per-node=1
), or--ntasks-per-node=8
(which will result in repeated runs of the script)I have also contacted the provided to check whether this is the intended behaviour.
Checklist
To reproduce
Minimal Code/Config snippet to reproduce
Does not work on my provider:
Stack trace/error message
Removing the parameter is not possible
Stack trace/error message
Setting it to 8 works, but results in superfluous runs
Stack trace/error message
Not inheriting from
submitit_slurm
worksStack trace/error message
Expected Behavior
The key
hydra.launcher.submitit_slurm.tasks_per_node
should not be set by default, as it is does not appear to be required by slurm.System information