Open sparisi opened 10 months ago
The first step would be to be sure you can achieve this with SLURM. Once you figure out the SLURM settings to use, then you can look into how to translate it into submitit options.
If it's not doable with SLURM, you could use a 2-step approach where a first script launches the N jobs, then each job uses a local launcher (e.g. joblib) to run the desired jobs on the machine.
@odelalleau I don't see why it should not be doable, but I don't realy know how the submitit plugin submits jobs. If I request n nodes, x tasks per node, y cpus per task, how many machines am I going to request? And how does array_parallelism work? Does it run multiple runs in parallel over CPUs? I cannot find doc about this.
I am already doing the 2-step approach but I would like to automate the process.
I'm not familiar enough with SLURM to answer this question -- my point was, before trying to make it work with submitit, first make sure that SLURM supports this kind of parallelization (it's not obvious to me).
What I had in mind for the 2-step approach would still be fully automated (the second step is launched automatically on each machine by SLURM).
I see. Yes, I am writing a script to automate the 2-step option. Still, having to use just 1 YAML file with the submitit launcher config would be nicer / simpler.
I need to submit thousands of parallel short runs. I currently do it with the Joblib plugin because it's very easy to use and fast, but I submit runs in chunks. E.g., first I submit 500 runs running in parallel over 32 CPUs. When they are done, I submit other 500, and so on.
I would like to do parallelize everything with the Slurm pluign like that:
N
Slurm jobs, each requestingX
CPUs from 1 machine.Y
jobs, of whichX
runs in parallel.I am having trouble understanding the right parameters to pass to the Slurm launcher. For example, say my current config is the following
How do I set tasks, nodes, array to have, for instance, N Slurm jobs over N machine, each with 32 CPUs, and each machine runs 500 runs in parallel over those 32 CPUs? (N should be determined automatically depending on the total number of runs of the sweep).