Open Flamefire opened 8 months ago
Note that not all sites (like us) allow ssh between nodes in a slurm job
That was just an example. Both methods seem to work, I had the 2nd in use while our nodes where updated and didn't allow SSH yet. And only the 2nd can be reasonably done in EasyBuild.
Makes sense to me, and it probably makes sense to implement this in EasyBuild 5.0 (although it shouldn't actually break anything, only fix builds that break/hang because they're running in a Slurm environment which may cause trouble with MPI).
We briefly discussed this during the EasyBuild conf call today, and the general consensus seemed to be that this should be made opt-in rather than opt-out (which makes sense to me)
Some software tests fail when run inside a SLURM job, e.g. OpenMPI which does
mpirun
that picks up the SLURM job it is running in and fails as the resulting configuration doesn't match what the job is expecting.I have 2 workarounds in my EB wrapper script:
This basically restarts the current script via
ssh
if run from inside a SLURM job assuming only 1 node.This removes all
SLURM_*
variables from the current environment.As the issue is a common pitfall with EB and given how easy the 2nd variant is to implement in EB via
os.environ
I'd suggest to do this by default, possibly with a--no-cleanup-slurm-env
option to opt-out.