facebookresearch / hydra

Hydra is a framework for elegantly configuring complex applications
https://hydra.cc
MIT License
8.61k stars 619 forks source link

Hydra + Submitit Help #2786

Open sparisi opened 10 months ago

sparisi commented 10 months ago

I need to submit thousands of parallel short runs. I currently do it with the Joblib plugin because it's very easy to use and fast, but I submit runs in chunks. E.g., first I submit 500 runs running in parallel over 32 CPUs. When they are done, I submit other 500, and so on.

I would like to do parallelize everything with the Slurm pluign like that:

I am having trouble understanding the right parameters to pass to the Slurm launcher. For example, say my current config is the following

# @package hydra.launcher
_target_: hydra_plugins.hydra_submitit_launcher.submitit_launcher.SlurmLauncher
submitit_folder: ${hydra.sweep.dir}/.submitit/%j
name: ${hydra.job.name}
partition: default
timeout_min: 10
cpus_per_task: 32
tasks_per_node: 1
mem_gb: 4
nodes: 1
max_num_timeout: 100
array_parallelism: 256

How do I set tasks, nodes, array to have, for instance, N Slurm jobs over N machine, each with 32 CPUs, and each machine runs 500 runs in parallel over those 32 CPUs? (N should be determined automatically depending on the total number of runs of the sweep).

odelalleau commented 10 months ago

The first step would be to be sure you can achieve this with SLURM. Once you figure out the SLURM settings to use, then you can look into how to translate it into submitit options.

If it's not doable with SLURM, you could use a 2-step approach where a first script launches the N jobs, then each job uses a local launcher (e.g. joblib) to run the desired jobs on the machine.

sparisi commented 10 months ago

@odelalleau I don't see why it should not be doable, but I don't realy know how the submitit plugin submits jobs. If I request n nodes, x tasks per node, y cpus per task, how many machines am I going to request? And how does array_parallelism work? Does it run multiple runs in parallel over CPUs? I cannot find doc about this.

I am already doing the 2-step approach but I would like to automate the process.

odelalleau commented 10 months ago

I'm not familiar enough with SLURM to answer this question -- my point was, before trying to make it work with submitit, first make sure that SLURM supports this kind of parallelization (it's not obvious to me).

What I had in mind for the 2-step approach would still be fully automated (the second step is launched automatically on each machine by SLURM).

sparisi commented 10 months ago

I see. Yes, I am writing a script to automate the 2-step option. Still, having to use just 1 YAML file with the submitit launcher config would be nicer / simpler.