ALRhub / clusterduck

clusterduck is a hydra launcher plugin for running jobs in batches on a SLURM cluster. It is intended for small tasks on clusters where jobs have exclusive access to a node, such that submitting a single task to a node would be wasteful.
12 stars 1 forks source link

rename "per_node" fields #15

Open pbecker93 opened 1 year ago

pbecker93 commented 1 year ago

Should we rename the "per_node" fileds (i.e. parallel_runs_per_node, total_runs_per_node) as they are not per node but per slurm job? I get the issue of everything being called a "job", so maybe really never use it standalone but be like "slurm_job", "hydra_job"? Or maybe someone has a better solution

ScheiklP commented 1 year ago

On the contrary, I would prefer changing the implementation, so that it is "per node" and not "per slurm job". :D I think it is very convenient to do in on a per node basis, because ain't no body got time for figuring out how many parallel things can be started in a job, if there are n nodes with m gpus.

balazsgyenes commented 1 year ago

@pbecker93, how do you feel about parallel_runs_per_slurm?

@ScheiklP, but clusterduck doesn't control how many nodes there are, just the number of slurm jobs that are started. And in theory you can have slurm jobs that use multiple nodes. I'm not quite sure I understand what needs to be "figured out" in your use case, maybe you can explain.

ScheiklP commented 1 year ago

I guess my expected / desired behavior for num_nodes > 1 would be that clusterduck just scales stuff up.

So for total_runs_per_node = 10, parallel_runs_per_node = 4, num_nodes = 2 -> per job 8 parallel runs with 20 runs in total.

But as you said, there is only ever one node :D

pbecker93 commented 1 year ago

@balazsgyenes : better, but parallel_runs_per_slurm_job might be even more explicit?

@ScheiklP I am with Balazs here, on any normal cluster (i.e. not Horeka) the same number of jobs might not even end up on the same number of physical nodes at different times - and there is no control over how many nodes you get

ScheiklP commented 1 year ago

@pbecker93 Sure you can. That's what the nodes parameter of SLURM is for. The job of nodes=2 will be like one machine.

balazsgyenes commented 1 year ago

Paul, I'm still not totally sure what you mean by "there is only ever one node".

If I understood you correctly, I would have nothing against num_parallel_slurm_jobs, where a user can specify either that or total_runs_per_slurm_job (but not both), and the hydra jobs would be distributed among the slurm jobs accordingly. I think it might be a bit confusing for a first-time user, but maybe worth it. Is this what you want?

ScheiklP commented 1 year ago

The jobs that clusterduck submits only request 1 node per job, right?

balazsgyenes commented 1 year ago

It's configurable. The intended use is to request all the resources that you know a node has, and then request one node, but slurm isn't actually supposed to work like that. If I say num_nodes=2, that means that each job is spread over 2 nodes, but doesn't necessarily have exclusive access to them.

ScheiklP commented 1 year ago

Exclusivity of resources on a node is defined by the cluster maintainer. For Alex, for example, nodes are not exclusive, but GPUs are. For Horeka, all resources on a node are exclusive. So I am not sure what you mean.

So the current behavior is total_runs_per_node = 10, parallel_runs_per_node = 4, num_nodes = 2 -> per job 4 parallel runs with 10 runs in total, 2 parallel runs per node?

balazsgyenes commented 1 year ago

num_nodes is a slurm parameter that is very different from the rest. If you specify n_tasks=4 and num_nodes=2, your two tasks will be spread across 2 nodes, requiring inter-process communication to synchronize them. With a single task and multiple nodes, I think it duplicates tasks across nodes, but I'm not 100% certain.

So with total_runs_per_node = 10, parallel_runs_per_node = 4 -> 4 parallel runs per slurm job with up to 10 hydra jobs each, each hydra job might get run twice (I'm not sure), and you have no control over which nodes your slurm jobs run on.

ScheiklP commented 1 year ago

I have a very vague memory, that if n_tasks = m -> the same thing that you run will just be executed m times in parallel for each node.

So if your script says python train.py, you have num_nodes = 2 and n_tasks = 4, it will run python train.py a total of 8 times. 4 parallel per node.