Open wtraylor opened 4 years ago
Thanks for opening this issue @wtraylor. This helps us to understand different Slurm environments better. We can change to building images on the login node by default and give an option to use the compute node (if singularity is not present on the login node). What do you think @wtraylor and @ivotron?
we could also make use of the --skip-pull
flag to indicate the slurm runner to not pull images when it executes.
My hunch is that it is more common to prepare everything for an experiment on the login node, while the actual execution happens on the computing nodes. For example, I would compile my application on the login node. So from my perspective it would be more intuitive to also prepare the container image on the login node.
But I am not a very seasoned HPC user.
My hunch is that it is more common to prepare everything for an experiment on the login node, while the actual execution happens on the computing nodes. For example, I would compile my application on the login node. So from my perspective it would be more intuitive to also prepare the container image on the login node.
But I am not a very seasoned HPC user.
another pattern I've heard from people working in HPC scenarios is that they don't have network connectivity at all to the outside world from the slurm cluster, not even on the login node, so they need to scp images to the login node and then run from there. In these scenarios, doing --skip-pull
and --skip-clone
would help. Maybe we also need a --skip-build
?
So for this issue, do we agree that we can do this:
We can change to building images on the login node by default and give an option to use the compute node (if singularity is not present on the login node). What do you think @wtraylor and @ivotron?
given what @wtraylor mentioned above, this would address this issue, right?
Maybe we also need a --skip-build? Since building an image typically involves installing software within that image, I think skipping the build would be required, too.
sorry, I didn't explain well. What I had in mind was running the entire workflow first on the frontend node once, so it builds containers, but in single-node mode (i.e. just doing popper run
with no -r
flag). Then subsequently running popper run --skip-pull -r slurm
. But I agree that it would be better to build containers locally rather than on each node, and control the behavior via config file.
Also, since the folder one uses on the login node is shared with all the nodes, building multiple times on each node is redundant.
Yeah, building redundantly on multiple nodes is wrong. We can make changes to have 2 modes controlled through the config : i) build on login node (like in local without srun
) ii) build on a single compute node (controlled through srun
). What do you think @ivotron ?
yeah, that sounds good. I'd go further and say to not implement ii) until users request it
On the computer cluster I am using (Goethe HLR), only the login node has internet access, not the computing nodes. Therefore, building or pulling a Singularity image on the computing node does not work.
I run the workflow like this:
The details of the workflow are not relevant, but what’s important is that Popper (version
2020.09.01
) now tries to executesingularity pull
through SLURM, usingsrun
. That happens in src/popper/runner_slurm.py. Apparently, this behavior was introduced in pull request #912.I don’t have a good suggestion. It seems like some people want their Singularity images built on the computing node, and others (like me) on the login node.