Multiple workers-per-allocation not working

tsthakur commented 2 years ago

So from what I understood --workers-per-alloc <worker_count> is used to run multiple workers on multiple nodes, but it doesn't seem to behave that way. For example if I want to run 5 calculations on one node, I use an automatic allocation with 1 worker per allocation which then launches 5 jobs (1 task per job) on 1 node. So this is working as expected. Now if I launch an allocation with 2 workers per allocation, I was expecting that my 10 calculations would run on 2 nodes, with each node having its own worker. But what happens is that 2 workers occupy 2 nodes, but any of the 10 calculations does not launch stating the following error -

srun: Job 748536 step creation temporarily disabled, retrying (Requested nodes are busy)

Please note that the nodes are not in fact busy as there is one 'big' Slurm job running on these 2 nodes. But there is no process running on either nodes if I check with top or htop.

I may be completely misunderstanding the purpose of this --workers-per-alloc <worker_count> command. In which case, would it be possible to do what I am trying to do in some other way? Or is it a use case that is not planned to be supported?

Kobzol commented 2 years ago

Hi :) The --workers-per-alloc flag specifies the maximum number of nodes which will be requested for a Slurm/PBS allocation. Unless you want to do something with multiple nodes, like running MPI code (which is not properly supported yet though in HQ) or there is some other reason why you want multi-node allocations, it shouldn't really be needed to use this flag.

That being said, if you do use it, it should enqueue a Slurm allocation with 2 nodes, and then create a worker on each of those nodes. It seems that in your situation, the second step (creating the workers) has failed. I'm not really sure why, I will try to investigate the error message that you have posted.

ramirezfranciscof commented 2 years ago

Hi :) The --workers-per-alloc flag specifies the maximum number of nodes which will be requested for a Slurm/PBS allocation. Unless you want to do something with multiple nodes, like running MPI code (which is not properly supported yet though in HQ) or there is some other reason why you want multi-node allocations, it shouldn't really be needed to use this flag.

Hey @Kobzol , just for clarification, you mean to say that running MPI is not supported across multiple nodes (i.e. worker communication) or is it not supported in multiple cores of a single node either? I have been trying to use it for this and came across some problems (like canceling jobs not cancelling the run of the underlying script-wrapped executable), but I'm not sure if it would make sense to report them if you don't support the use case yet.

Kobzol commented 2 years ago

I was talking about multi-node MPI specifically. While there's nothing stopping you from using multi-node MPI, currently you can run each task on a single node only. So HyperQueue cannot guarantee that you will get multiple nodes available for the task at the same time. There is support for multi-node tasks, but it's currently heavily experimental and WIP.

Using MPI on a single node should be completely fine, you can just say how many cores does your task need.

Regarding the issue that tasks may not be killed properly when a task is cancelled or a worker is killed, we are aware of it (https://github.com/It4innovations/hyperqueue/issues/431).

In general, please report any issues that you find (unless they are a strict duplicate of some existing issue) :)

tsthakur commented 2 years ago

Hello @Kobzol , thank you for the explanation.

I wouldn't need to use this command normally. But on Swiss supercomputing centre CSCS, the priority between using 1 node or 10 is the same. So launching 50 jobs on 10 nodes will be a much more efficient use of computational resources. Each of these jobs and consequently the calculations will only run on one node. And I was simply trying to bundle them together.

In this case, the workers are created and hence the nodes are occupied, the jobs are also shown as running, but it is the code (, which I use to run the calculation, that outputs the error I posted and somehow fails to recognise the way HQ allocates resources. From what I understood the code shouldn't even know if there are one or multiple nodes since HQ is a handling the connection with the cluster.

Does this specific use case make sense or is it something that HQ is not designed to achieve? Thanks!

Kobzol commented 2 years ago

So launching 50 jobs on 10 nodes will be a much more efficient use of computational resources.

Is there any specific motivation for this? Unless you overload Slurm with thousands of allocations, or you hit some Slurm permission limit for the number of allocations in the queue, it should be fine to have 50 allocations, each with a single node. To HyperQueue, it won't be any different than e.g. having just 5 allocations, each with 10 nodes.

Does this specific use case make sense or is it something that HQ is not designed to achieve?

The use case seems perfectly fine :) Could you maybe share the code that you're running (i.e. the script that you're submitting to HyperQueue)? By any chance, does your code invoke srun?

tsthakur commented 2 years ago

The reason for running those jobs together is that they dont scale well with no. of CPU cores in the cluster. Hence running 4 or 5 of those calculations bundled together on one node is a more efficient use of resources. Now I could run 50 calculations on 10 nodes using 10 seperate allocations, but then I run 10 Slurm jobs which would be prioritised lower than a single slurm job running on 10 nodes. Hence the motivation to further bundle jobs not only on one node but across the nodes.

Yes the code indeed uses srun command. This is the code/binary (https://github.com/lekah/pinball) and following is the script I use -

!/bin/bash

#SBATCH -A mr0 # account
#SBATCH -C mc # constraint
#SBATCH --hint=nomultithread

module load PrgEnv-intel
module use /users/tthakur/easybuild/eiger/modules/all/Toolchain/cpeIntel/21.08/
module load QuantumESPRESSO/5.2-PINBALL-eiger-202109
export OMP_NUM_THREADS="1"

srun --cpu-bind=map_cpu:$HQ_CPUS '-s' '-n' '32' '--mem' '50000' '/users/tthakur/easybuild/eiger/software/QuantumESPRESSO/5.2-PINBALL-eiger-202109-cpeIntel-21.08/pw.x' '-in' 'aiida.in'  > 'aiida.out'

And I use this command to submit the script above -

hq submit --name="aiida-489044" --stdout=_scheduler-stdout.txt --stderr=_scheduler-stderr.txt --time-request=43100s --time-limit=43100s --cpus=32 ./_submit_script.sh

Kobzol commented 2 years ago

The reason for running those jobs together is that they dont scale well with no. of CPU cores in the cluster. Hence running 4 or 5 of those calculations bundled together on one node is a more efficient use of resources. Now I could run 50 calculations on 10 nodes using 10 seperate allocations, but then I run 10 Slurm jobs which would be prioritised lower than a single slurm job running on 10 nodes. Hence the motivation to further bundle jobs not only on one node but across the nodes.

I see, now I understand. That is indeed a valid use case for multi-node allocations.

The fact that your script uses srun is causing the problem. It only happens with multi-node allocations, because for these HQ itself uses srun to start the HQ worker on each allocated node.

Now, I suppose that it would be possible to make these two srun s compatible, and we should probably change our usage of srun in a way that makes it possible to also use srun inside of the submitted script. On the other hand, I would like to understand more about your use-case, because running srun inside a script submitted to HyperQueue seems like an anti-pattern to me, but maybe it's just an usecase that we haven't encountered yet.

I'm not that familiar with SLURM, but it seems to me that you execute your program 32 times with srun, and each copy uses a single core? A few comments:

If you just want to execute 32 copies of your program, and the copies are completely independent, you should use task arrays, submit a HQ job with 32 tasks, each with a single core, and then just run a single copy of your program inside each task. You should let HyperQueue handle these tasks explicitly, rather than treating them as one large opaque HQ job.
If your 32 program copies communicate together, I suppose that they use MPI? In that case I suggest that you use mpirun or something similar to execute your program, instead of srun. Running single-node MPI code inside a HQ task is definitely a valid use-case and we would like to learn more about how HQ users do this, to make HQ better prepared for it.

tsthakur commented 2 years ago

The fact that your script uses srun is causing the problem. It only happens with multi-node allocations, because for these HQ itself uses srun to start the HQ worker on each allocated node.

Yes, this does sound like a probable cause.

I'm not that familiar with SLURM, but it seems to me that you execute your program 32 times with srun, and each copy uses a single core?

Actually no, there is a single instance running on 32 cores using a single invocation of the srun command. So I think using taskarray is probably not required.

Now, I suppose that it would be possible to make these two srun s compatible, and we should probably change our usage of srun in a way that makes it possible to also use srun inside of the submitted script.

This would be super helpful. I would also try to find if we could use some other command like mpirun in our scripts to check if that solves the problem.

On the other hand, I would like to understand more about your use-case, because running srun inside a script submitted to HyperQueue seems like an anti-pattern to me, but maybe it's just an usecase that we haven't encountered yet.

So we normally run our calculations using srun and I didn't change anything in my scripts. I wanted to bundle these calculations together. Since I only have a basic understanding of HQ from the documentation and I don't know how it works under the hood, I didn't think of making any changes to my submission script to make it more compatible with HQ.

Kobzol commented 2 years ago

So we normally run our calculations using srun and I didn't change anything in my scripts. I wanted to bundle these calculations together. Since I only have a basic understanding of HQ from the documentation and I don't know how it works under the hood, I didn't think of making any changes to my submission script to make it more compatible with HQ.

For simple usecases, it shouldn't be required to alter the script that is submitted to HQ, when switching e.g. from PBS/Slurm. However, the usage of srun within the script here is a bit problematic. I will try to change how srun is used by HQ to resolve the srun clash, but I'm still unsure why you need srun at all.

Actually no, there is a single instance running on 32 cores using a single invocation of the srun command. So I think using taskarray is probably not required.

This description does not correspond to your script though. If you run srun -n 32 <program>, Slurm will execute your program 32 times.

I tried it on a Slurm cluster:

$ srun -n4 hostname
r250n04
r250n04
r250n03
r250n03

Using -n4 executed four copies of hostname.

So what your script basically does is that it runs 32 copies of pw.x, with each copy bound to a single core. I assume that pw.x uses MPI internally (?) so that it looks like a single program execution to you. Furthermore, I think that srun will run these 32 copies interspersed on all nodes allocated by the current Slurm allocation, which might not be what you want (I suppose that you only want to run them on a single node).

Kobzol commented 2 years ago

I experimented with this a bit more on a Slurm cluster. Here's what I found: 1) If I run some "background" process using srun (which is what HQ does currently), and then I run something like what your script does (srun -s ...), then the second srun works, but it also prints the step creation temporarily disabled, retrying (Requested nodes are busy) warning. I'm not sure if this is what happens for you, maybe your code is not launched at all (?). 2) If I run some background process using srun --overlap and then I run srun -s, then it runs and does not print any warning. --overlap should make sure that the thing that we run (HQ workers in our case) will share resources with all other tasks. Hopefully this should solve your issue.

I created https://github.com/It4innovations/hyperqueue/pull/448 which should hopefully help with this issue. @tsthakur Can I send you a modified version of HQ with this patch applied, so that you could test if it helps? What's the CPU architecture of your cluster, x86?

tsthakur commented 2 years ago

This description does not correspond to your script though. If you run srun -n 32 , Slurm will execute your program 32 times.

You are probably right, I am not sure how pw.x works internally but it is possible that there are 32 copies running with each bound to a single core. I will get back to you on that soon.

I'm not sure if this is what happens for you, maybe your code is not launched at all (?).

Yes for me the code does not launch at all.

Can I send you a modified version of HQ with this patch applied, so that you could test if it helps? What's the CPU architecture of your cluster, x86?

Yes please, that would be very helpful. The architecture is x64

tsthakur commented 2 years ago

Hi @Kobzol

So on the CSCS cluster, where I am running, they ask to use srun command - https://confluence.cscs.ch/pages/viewpage.action?pageId=284426490#Alps(Eiger)UserGuide-RunningJobs

On other clusters, people would normally use mpirun -np 32 pw.x

So for now, we are stuck with using srun I will try using the --overlap option and see if that helps.

Kobzol commented 2 years ago

So on the CSCS cluster, where I am running, they ask to use srun command -

I see.

If you're not scared of running random binaries :) I built a version of HQ which uses --overlap for starting the workers. I tried to use an old version of glibc for the build to make it as compatible as possible. hq.tar.gz

tsthakur commented 2 years ago

So I tried the new binary, and I still get the same error

srun: Job 771737 step creation temporarily disabled, retrying (Requested nodes are busy) srun: Job 771737 step creation still disabled, retrying (Requested nodes are busy)

And I still see that nothing is running on the occupied nodes.

Kobzol commented 2 years ago

Does this also happen if you run your own srun with --overlap?

tsthakur commented 2 years ago

Yes. I tried running with --overlap flag on my srun with the old v0.11 and the new binary you sent. In both cases, I got the same error.

Kobzol commented 2 years ago

I see. Looks like the Slurm on your cluster in set up in a way that even tasks started with srun --overlap take resources (?). I'm not very familiar with Slurm, so I'm not sure how to resolve this (if someone knows a better way, I'd be glad to know!). I'll try to think of other ways of starting the HQ workers in Slurm allocations so that we don't have to use srun.

tsthakur commented 2 years ago

Looks like the Slurm on your cluster in set up in a way that even tasks started with srun --overlap take resources

Yes, that seems a likely scenario. Is there a way to confirm this? Like by using error logs or something. I see that hq log commands require a log file as input?? I do not fully understand how that works. And in the stdout and stderr logs I don't see any conclusive information.

Kobzol commented 2 years ago

I suppose that the warning that you get (step creation temporarily disabled) is enough of a confirmation that Slurm doesn't allow running additional srun, even with the --overlap flag.

hq log serves for working with HQ streaming data (maybe we should rename it, as it seems to produce some confusion :) ).

unkcpz commented 4 months ago

I did encounter the same issue on the same machine (I am from same group as @tsthakur and @ramirezfranciscof). I workaround it by using --overlap flag when submit to hq worker. Meanwhile, I put -s (oversubscribing) when hq alloc queue for worker. I just submit ~1000 hq jobs, will report back tomorrow.

@tsthakur if you still use aiida-hyperqueue, you can try with my branch at https://github.com/aiidateam/aiida-hyperqueue/pull/18

Kobzol commented 4 months ago

What version of HyperQueue are you using? Since two years ago, HQ deploys workers in Slurm multi-node allocations using srun --overlap.

unkcpz commented 4 months ago

I use 0.19.0 now. Meanwhile, I encounter the issue not with multiple workers per allocation (sorry for chiming in this issue and bring confusion). I have 1 worker from 1 allocation with 1 node which I assume the standard use case of HQ. In general I think using srun to run MPI inside srun might be the source of problem as you @Kobzol pointed out before. We use cray-MPICH and it only has srun for MPI parallel. If I understand correctly, the first srun of hq alloc (call by sbatch I assume?) is standard slurm srun that ask for nodes, the srun inside job script is alternative to the regular mpirun.

Kobzol commented 4 months ago

If you configure hq alloc to use single-node allocations (the default), then no srun will be used. HQ will use sbatch to submit a Slurm allocation, and then simply run hq worker start (without srun) in the submitted bash script.

If you're encountering a problem that is unrelated to multiple workers per allocation, please open a new issue to avoid piggybacking on this one :) Thanks!

It4innovations / hyperqueue

Multiple workers-per-allocation not working #443