Closed NoelAraujo closed 3 years ago
Not certain, but I think this might be a limitation with singularity / SLURM, and not with ClusterManagers.jl
I recently tried to use a python package from inside a singularity container that submits SLURM jobs, and long story short, the cluster environment was such that there was no way to get the container to be able to see the necessary components in order to actually call sbatch
or srun
.
(...) #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --cpus-per-task=2 #SBATCH --partition=slow (...) srun singularity exec \ --bind=/scratch:/scratch \ --bind=/var/spool/slurm:/var/spool/slurm \ work.simg /opt/julia/bin/julia julia_parallel_test.jl
Wait, are you calling srun
from inside your sbatch
script? That seems a bit weird - typically, your script would have the #SBATCH
lines, and then be normal shell, so eg you'd have my_script.sh
:
(...)
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=2
#SBATCH --partition=slow
singularity exec \
--bind=/scratch:/scratch \
--bind=/var/spool/slurm:/var/spool/slurm \
work.simg /opt/julia/bin/julia julia_parallel_test.jl
and then run sbatch my_script.sh
Or alternatively,
srun --nodes 1 --ntasks-per-node=1 --cpus-per-task=2 --partition=slow \
singularity exec \
--bind=/scratch:/scratch \
--bind=/var/spool/slurm:/var/spool/slurm \
work.simg /opt/julia/bin/julia julia_parallel_test.jl
OK, no easy solution, looks like we have to sit and wait untl Singularity community to improve its features.
Regarding srun singularity exec \
, it was not my idea, I just used an example provided by the cluster staff.
I just used an example provided by the cluster staff.
Interesting :thinking:. Is your julia script (julia_parallel_test.jl
) also trying to spawn srun
commands? Seems like nesting calls to the scheduler could be part of the problem, but it's hard to know without knowing more about the cluster architecture.
But based on my experience, I think this is a limitation with singularity rather than with ClusterManagers.jl, so I'll close this for now - I'm happy to keep the conversation going on discourse, slack or zulip if you want to start a thread, feel free to tag me.
I tried to run a singularity container inside a HPC cluster, my goal is to run accross many nodes. However, and I cannot even create one process. The lines below are how I submit my job:
work.simg
is my singularity image, that on my computer I am sure that everything works fine./opt/julia/bin/julia
is the executable path inside the imagejulia_parallel_test.jl
is the most simple code that I come up:The message the concern us is here:
The complete error is in the end.
What I think is happening is that :
ClusterManagers.jl
wants to run/opt/julia/bin/julia
over the Node, and not on my singularity image. First of all, am I right ? Second, does anybody has a simple solution ?Thank you for your attention.