Closed tsthakur closed 2 years ago
Hi :) The --workers-per-alloc
flag specifies the maximum number of nodes which will be requested for a Slurm/PBS allocation. Unless you want to do something with multiple nodes, like running MPI code (which is not properly supported yet though in HQ) or there is some other reason why you want multi-node allocations, it shouldn't really be needed to use this flag.
That being said, if you do use it, it should enqueue a Slurm allocation with 2 nodes, and then create a worker on each of those nodes. It seems that in your situation, the second step (creating the workers) has failed. I'm not really sure why, I will try to investigate the error message that you have posted.
Hi :) The
--workers-per-alloc
flag specifies the maximum number of nodes which will be requested for a Slurm/PBS allocation. Unless you want to do something with multiple nodes, like running MPI code (which is not properly supported yet though in HQ) or there is some other reason why you want multi-node allocations, it shouldn't really be needed to use this flag.
Hey @Kobzol , just for clarification, you mean to say that running MPI is not supported across multiple nodes (i.e. worker communication) or is it not supported in multiple cores of a single node either? I have been trying to use it for this and came across some problems (like canceling jobs not cancelling the run of the underlying script-wrapped executable), but I'm not sure if it would make sense to report them if you don't support the use case yet.
I was talking about multi-node MPI specifically. While there's nothing stopping you from using multi-node MPI, currently you can run each task on a single node only. So HyperQueue cannot guarantee that you will get multiple nodes available for the task at the same time. There is support for multi-node tasks, but it's currently heavily experimental and WIP.
Using MPI on a single node should be completely fine, you can just say how many cores does your task need.
Regarding the issue that tasks may not be killed properly when a task is cancelled or a worker is killed, we are aware of it (https://github.com/It4innovations/hyperqueue/issues/431).
In general, please report any issues that you find (unless they are a strict duplicate of some existing issue) :)
Hello @Kobzol , thank you for the explanation.
I wouldn't need to use this command normally. But on Swiss supercomputing centre CSCS, the priority between using 1 node or 10 is the same. So launching 50 jobs on 10 nodes will be a much more efficient use of computational resources. Each of these jobs and consequently the calculations will only run on one node. And I was simply trying to bundle them together.
In this case, the workers are created and hence the nodes are occupied, the jobs are also shown as running, but it is the code (, which I use to run the calculation, that outputs the error I posted and somehow fails to recognise the way HQ allocates resources. From what I understood the code shouldn't even know if there are one or multiple nodes since HQ is a handling the connection with the cluster.
Does this specific use case make sense or is it something that HQ is not designed to achieve? Thanks!
So launching 50 jobs on 10 nodes will be a much more efficient use of computational resources.
Is there any specific motivation for this? Unless you overload Slurm with thousands of allocations, or you hit some Slurm permission limit for the number of allocations in the queue, it should be fine to have 50 allocations, each with a single node. To HyperQueue, it won't be any different than e.g. having just 5 allocations, each with 10 nodes.
Does this specific use case make sense or is it something that HQ is not designed to achieve?
The use case seems perfectly fine :) Could you maybe share the code that you're running (i.e. the script that you're submitting to HyperQueue)? By any chance, does your code invoke srun
?
The reason for running those jobs together is that they dont scale well with no. of CPU cores in the cluster. Hence running 4 or 5 of those calculations bundled together on one node is a more efficient use of resources. Now I could run 50 calculations on 10 nodes using 10 seperate allocations, but then I run 10 Slurm jobs which would be prioritised lower than a single slurm job running on 10 nodes. Hence the motivation to further bundle jobs not only on one node but across the nodes.
Yes the code indeed uses srun
command. This is the code/binary (https://github.com/lekah/pinball)
and following is the script I use -
!/bin/bash
#SBATCH -A mr0 # account
#SBATCH -C mc # constraint
#SBATCH --hint=nomultithread
module load PrgEnv-intel
module use /users/tthakur/easybuild/eiger/modules/all/Toolchain/cpeIntel/21.08/
module load QuantumESPRESSO/5.2-PINBALL-eiger-202109
export OMP_NUM_THREADS="1"
srun --cpu-bind=map_cpu:$HQ_CPUS '-s' '-n' '32' '--mem' '50000' '/users/tthakur/easybuild/eiger/software/QuantumESPRESSO/5.2-PINBALL-eiger-202109-cpeIntel-21.08/pw.x' '-in' 'aiida.in' > 'aiida.out'
And I use this command to submit the script above -
hq submit --name="aiida-489044" --stdout=_scheduler-stdout.txt --stderr=_scheduler-stderr.txt --time-request=43100s --time-limit=43100s --cpus=32 ./_submit_script.sh
The reason for running those jobs together is that they dont scale well with no. of CPU cores in the cluster. Hence running 4 or 5 of those calculations bundled together on one node is a more efficient use of resources. Now I could run 50 calculations on 10 nodes using 10 seperate allocations, but then I run 10 Slurm jobs which would be prioritised lower than a single slurm job running on 10 nodes. Hence the motivation to further bundle jobs not only on one node but across the nodes.
I see, now I understand. That is indeed a valid use case for multi-node allocations.
The fact that your script uses srun
is causing the problem. It only happens with multi-node allocations, because for these HQ itself uses srun
to start the HQ worker on each allocated node.
Now, I suppose that it would be possible to make these two srun
s compatible, and we should probably change our usage of srun
in a way that makes it possible to also use srun
inside of the submitted script. On the other hand, I would like to understand more about your use-case, because running srun
inside a script submitted to HyperQueue seems like an anti-pattern to me, but maybe it's just an usecase that we haven't encountered yet.
I'm not that familiar with SLURM, but it seems to me that you execute your program 32 times with srun
, and each copy uses a single core? A few comments:
mpirun
or something similar to execute your program, instead of srun
. Running single-node MPI code inside a HQ task is definitely a valid use-case and we would like to learn more about how HQ users do this, to make HQ better prepared for it.The fact that your script uses
srun
is causing the problem. It only happens with multi-node allocations, because for these HQ itself usessrun
to start the HQ worker on each allocated node.
Yes, this does sound like a probable cause.
I'm not that familiar with SLURM, but it seems to me that you execute your program 32 times with
srun
, and each copy uses a single core?
Actually no, there is a single instance running on 32 cores using a single invocation of the srun
command. So I think using taskarray is probably not required.
Now, I suppose that it would be possible to make these two
srun
s compatible, and we should probably change our usage ofsrun
in a way that makes it possible to also usesrun
inside of the submitted script.
This would be super helpful. I would also try to find if we could use some other command like mpirun
in our scripts to check if that solves the problem.
On the other hand, I would like to understand more about your use-case, because running
srun
inside a script submitted to HyperQueue seems like an anti-pattern to me, but maybe it's just an usecase that we haven't encountered yet.
So we normally run our calculations using srun
and I didn't change anything in my scripts. I wanted to bundle these calculations together. Since I only have a basic understanding of HQ from the documentation and I don't know how it works under the hood, I didn't think of making any changes to my submission script to make it more compatible with HQ.
So we normally run our calculations using srun and I didn't change anything in my scripts. I wanted to bundle these calculations together. Since I only have a basic understanding of HQ from the documentation and I don't know how it works under the hood, I didn't think of making any changes to my submission script to make it more compatible with HQ.
For simple usecases, it shouldn't be required to alter the script that is submitted to HQ, when switching e.g. from PBS/Slurm. However, the usage of srun
within the script here is a bit problematic. I will try to change how srun
is used by HQ to resolve the srun
clash, but I'm still unsure why you need srun
at all.
Actually no, there is a single instance running on 32 cores using a single invocation of the srun command. So I think using taskarray is probably not required.
This description does not correspond to your script though. If you run srun -n 32 <program>
, Slurm will execute your program 32 times.
I tried it on a Slurm cluster:
$ srun -n4 hostname
r250n04
r250n04
r250n03
r250n03
Using -n4
executed four copies of hostname
.
So what your script basically does is that it runs 32 copies of pw.x
, with each copy bound to a single core. I assume that pw.x
uses MPI internally (?) so that it looks like a single program execution to you. Furthermore, I think that srun
will run these 32 copies interspersed on all nodes allocated by the current Slurm allocation, which might not be what you want (I suppose that you only want to run them on a single node).
I experimented with this a bit more on a Slurm cluster. Here's what I found:
1) If I run some "background" process using srun
(which is what HQ does currently), and then I run something like what your script does (srun -s ...
), then the second srun
works, but it also prints the step creation temporarily disabled, retrying (Requested nodes are busy)
warning. I'm not sure if this is what happens for you, maybe your code is not launched at all (?).
2) If I run some background process using srun --overlap
and then I run srun -s
, then it runs and does not print any warning. --overlap
should make sure that the thing that we run (HQ workers in our case) will share resources with all other tasks. Hopefully this should solve your issue.
I created https://github.com/It4innovations/hyperqueue/pull/448 which should hopefully help with this issue. @tsthakur Can I send you a modified version of HQ with this patch applied, so that you could test if it helps? What's the CPU architecture of your cluster, x86?
This description does not correspond to your script though. If you run srun -n 32
, Slurm will execute your program 32 times.
You are probably right, I am not sure how pw.x works internally but it is possible that there are 32 copies running with each bound to a single core. I will get back to you on that soon.
I'm not sure if this is what happens for you, maybe your code is not launched at all (?).
Yes for me the code does not launch at all.
Can I send you a modified version of HQ with this patch applied, so that you could test if it helps? What's the CPU architecture of your cluster, x86?
Yes please, that would be very helpful. The architecture is x64
Hi @Kobzol
So on the CSCS cluster, where I am running, they ask to use srun
command -
https://confluence.cscs.ch/pages/viewpage.action?pageId=284426490#Alps(Eiger)UserGuide-RunningJobs
On other clusters, people would normally use mpirun -np 32 pw.x
So for now, we are stuck with using srun
I will try using the --overlap
option and see if that helps.
So on the CSCS cluster, where I am running, they ask to use srun command -
I see.
If you're not scared of running random binaries :) I built a version of HQ which uses --overlap
for starting the workers. I tried to use an old version of glibc for the build to make it as compatible as possible.
hq.tar.gz
So I tried the new binary, and I still get the same error
srun: Job 771737 step creation temporarily disabled, retrying (Requested nodes are busy)
srun: Job 771737 step creation still disabled, retrying (Requested nodes are busy)
And I still see that nothing is running on the occupied nodes.
Does this also happen if you run your own srun
with --overlap
?
Yes. I tried running with --overlap
flag on my srun
with the old v0.11 and the new binary you sent. In both cases, I got the same error.
I see. Looks like the Slurm on your cluster in set up in a way that even tasks started with srun --overlap
take resources (?). I'm not very familiar with Slurm, so I'm not sure how to resolve this (if someone knows a better way, I'd be glad to know!). I'll try to think of other ways of starting the HQ workers in Slurm allocations so that we don't have to use srun
.
Looks like the Slurm on your cluster in set up in a way that even tasks started with
srun --overlap
take resources
Yes, that seems a likely scenario. Is there a way to confirm this? Like by using error logs or something. I see that hq log
commands require a log file as input?? I do not fully understand how that works. And in the stdout
and stderr
logs I don't see any conclusive information.
I suppose that the warning that you get (step creation temporarily disabled
) is enough of a confirmation that Slurm doesn't allow running additional srun
, even with the --overlap
flag.
hq log
serves for working with HQ streaming data (maybe we should rename it, as it seems to produce some confusion :) ).
I did encounter the same issue on the same machine (I am from same group as @tsthakur and @ramirezfranciscof). I workaround it by using --overlap
flag when submit to hq worker. Meanwhile, I put -s
(oversubscribing) when hq alloc
queue for worker.
I just submit ~1000 hq jobs, will report back tomorrow.
@tsthakur if you still use aiida-hyperqueue
, you can try with my branch at https://github.com/aiidateam/aiida-hyperqueue/pull/18
What version of HyperQueue are you using? Since two years ago, HQ deploys workers in Slurm multi-node allocations using srun --overlap
.
I use 0.19.0
now. Meanwhile, I encounter the issue not with multiple workers per allocation (sorry for chiming in this issue and bring confusion). I have 1 worker from 1 allocation with 1 node which I assume the standard use case of HQ.
In general I think using srun
to run MPI inside srun
might be the source of problem as you @Kobzol pointed out before. We use cray-MPICH
and it only has srun
for MPI parallel. If I understand correctly, the first srun of hq alloc
(call by sbatch I assume?) is standard slurm srun
that ask for nodes, the srun
inside job script is alternative to the regular mpirun
.
If you configure hq alloc
to use single-node allocations (the default), then no srun
will be used. HQ will use sbatch
to submit a Slurm allocation, and then simply run hq worker start
(without srun
) in the submitted bash script.
If you're encountering a problem that is unrelated to multiple workers per allocation, please open a new issue to avoid piggybacking on this one :) Thanks!
So from what I understood
--workers-per-alloc <worker_count>
is used to run multiple workers on multiple nodes, but it doesn't seem to behave that way. For example if I want to run 5 calculations on one node, I use an automatic allocation with 1 worker per allocation which then launches 5 jobs (1 task per job) on 1 node. So this is working as expected. Now if I launch an allocation with 2 workers per allocation, I was expecting that my 10 calculations would run on 2 nodes, with each node having its own worker. But what happens is that 2 workers occupy 2 nodes, but any of the 10 calculations does not launch stating the following error -srun: Job 748536 step creation temporarily disabled, retrying (Requested nodes are busy)
Please note that the nodes are not in fact busy as there is one 'big' Slurm job running on these 2 nodes. But there is no process running on either nodes if I check with
top
orhtop
.I may be completely misunderstanding the purpose of this
--workers-per-alloc <worker_count>
command. In which case, would it be possible to do what I am trying to do in some other way? Or is it a use case that is not planned to be supported?