facebookresearch / hydra

Hydra is a framework for elegantly configuring complex applications
https://hydra.cc
MIT License
8.81k stars 634 forks source link

[Bug] Submitit has optional parameter tasks_per_node set #1898

Open nils-werner opened 2 years ago

nils-werner commented 2 years ago

🐛 Bug

Description

Submitit currently has the parameter tasks_per_node = 1 set by default and no way of removing it. Also overloading it with null does not seem to be possible.

For slurm this parameter is optional, so I am not sure if forcing its existence makes sense here.

The reason I have stumbled across this is because my HPC provider only allows this parameter

I have also contacted the provided to check whether this is the intended behaviour.

Checklist

To reproduce

Minimal Code/Config snippet to reproduce

Does not work on my provider:

# @package hydra.launcher
defaults:
  - submitit_slurm

gres: gpu

Stack trace/error message

[2021-11-22 09:13:46,006][HYDRA] Submitit 'slurm' sweep output dir : multirun/2021-11-22/09-13-44
[2021-11-22 09:13:46,009][HYDRA]        #0 : 
sbatch: error: Batch job submission failed: Requested node configuration is not available
subprocess.CalledProcessError: Command '['sbatch', '/home/me/mlops/repro/multirun/2021-11-22/09-13-44/.submitit/submission_file_bb628e0ca6884ebfaa8781633a607b52.sh']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/me/.conda/envs/mlops/lib/python3.8/site-packages/hydra/_internal/utils.py", line 211, in run_and_report
    return func()
  File "/home/me/.conda/envs/mlops/lib/python3.8/site-packages/hydra/_internal/utils.py", line 386, in <lambda>
    lambda: hydra.multirun(
  File "/home/me/.conda/envs/mlops/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 140, in multirun
    ret = sweeper.sweep(arguments=task_overrides)
  File "/home/me/.conda/envs/mlops/lib/python3.8/site-packages/hydra/_internal/core_plugins/basic_sweeper.py", line 157, in sweep
    results = self.launcher.launch(batch, initial_job_idx=initial_job_idx)
  File "/home/me/.conda/envs/mlops/lib/python3.8/site-packages/hydra_plugins/hydra_submitit_launcher/submitit_launcher.py", line 145, in launch
    jobs = executor.map_array(self, *zip(*job_params))
  File "/home/me/.conda/envs/mlops/lib/python3.8/site-packages/submitit/core/core.py", line 700, in map_array
    return self._internal_process_submissions(submissions)
  File "/home/me/.conda/envs/mlops/lib/python3.8/site-packages/submitit/auto/auto.py", line 218, in _internal_process_submissions
    return self._executor._internal_process_submissions(delayed_submissions)
  File "/home/me/.conda/envs/mlops/lib/python3.8/site-packages/submitit/slurm/slurm.py", line 313, in _internal_process_submissions
    return super()._internal_process_submissions(delayed_submissions)
  File "/home/me/.conda/envs/mlops/lib/python3.8/site-packages/submitit/core/core.py", line 822, in _internal_process_submissions
    job = self._submit_command(self._submitit_command_str)
  File "/home/me/.conda/envs/mlops/lib/python3.8/site-packages/submitit/core/core.py", line 863, in _submit_command
    output = utils.CommandFunction(command_list, verbose=False)()  # explicit errors
  File "/home/me/.conda/envs/mlops/lib/python3.8/site-packages/submitit/core/utils.py", line 350, in __call__
    raise FailedJobError(stderr) from subprocess_error
submitit.core.utils.FailedJobError: sbatch: error: Batch job submission failed: Requested node configuration is not available

Removing the parameter is not possible

# @package hydra.launcher
defaults:
  - submitit_slurm

tasks_per_node: null
gres: gpu

Stack trace/error message

In 'hydra/launcher/hpc': Validation error while composing config:
Non optional field cannot be assigned None
    full_key: hydra.launcher.tasks_per_node
    object_type=SlurmQueueConf

Setting it to 8 works, but results in superfluous runs

# @package hydra.launcher
defaults:
  - submitit_slurm

tasks_per_node: 8
gres: gpu

Stack trace/error message

[2021-11-22 09:14:51,968][HYDRA] Submitit 'slurm' sweep output dir : multirun/2021-11-22/09-14-50
[2021-11-22 09:14:51,972][HYDRA]        #0 :
cat multirun/2021-11-22/09-14-50/0/train.log
[2021-11-22 09:14:56,383][submitit][INFO] - Job completed successfully
[2021-11-22 09:14:56,383][submitit][INFO] - Job completed successfully
[2021-11-22 09:14:56,385][submitit][INFO] - Job completed successfully
[2021-11-22 09:14:56,387][submitit][INFO] - Job completed successfully
[2021-11-22 09:14:56,387][submitit][INFO] - Job completed successfully
[2021-11-22 09:14:56,388][submitit][INFO] - Job completed successfully
[2021-11-22 09:14:56,389][submitit][INFO] - Job completed successfully
[2021-11-22 09:14:56,390][submitit][INFO] - Job completed successfully

Not inheriting from submitit_slurm works

# @package hydra.launcher
submitit_folder: $&#123;hydra.sweep.dir/.submitit/%j
name: ${hydra.job.name}
_target_: hydra_plugins.hydra_submitit_launcher.submitit_launcher.SlurmLauncher
gres: gpu

Stack trace/error message

[2021-11-22 09:33:48,155][HYDRA] Submitit 'slurm' sweep output dir : multirun/2021-11-22/09-33-47
[2021-11-22 09:33:48,159][HYDRA]        #0 :

Expected Behavior

The key hydra.launcher.submitit_slurm.tasks_per_node should not be set by default, as it is does not appear to be required by slurm.

System information

nils-werner commented 2 years ago

I just realized that for my provider, also setting gpus_per_task fixes the issue

# @package hydra.launcher
defaults:
  - submitit_slurm

gpus_per_task: 1
gres: gpu

But the question whether this parameter is really required remains.

Jasha10 commented 2 years ago

The config on Hydra's side is based on the API exposed by the submitit library.

Looking at the submitit codebase, it seems to me that tasks_per_node must be an integer (i.e. cannot be null/None). In that file, the same default of tasks_per_node==1 is used.

It makes sense to me that, in the BaseQueueConf file you've linked above, we should have tasks_per_node typed as int rather than Optional[int] because this is what seems to be required by submitit.

jieru-hu commented 2 years ago

It makes sense to me that, in the BaseQueueConf file you've linked above, we should have tasks_per_node typed as int rather than Optional[int] because this is what seems to be required by submitit.

@Jasha10 not sure what you meant here - it seems the typing is already int not Optional[int]

@nils-werner thanks for the issue. Since the launcher is built on top of structured config, we'd need the tasks_per_node present in the conf dataclass. but, as you mentioned, you can customize your launcher config however you want so you don't necessarily need to inherit the launcher config.

Jasha10 commented 2 years ago

Sorry, I think my wording was unclear. Yes, the typing is currently int (not Optional[int]). What I meant to say is that I think we should keep it that way (because that seems to be what submitit accepts).