easybuilders / easybuild-framework

EasyBuild is a software installation framework in Python that allows you to install software in a structured and robust way.
https://easybuild.io
GNU General Public License v2.0
148 stars 202 forks source link

Clean up SLURM/Batchsystem environment before doing builds #4434

Open Flamefire opened 8 months ago

Flamefire commented 8 months ago

Some software tests fail when run inside a SLURM job, e.g. OpenMPI which does mpirun that picks up the SLURM job it is running in and fails as the resulting configuration doesn't match what the job is expecting.

I have 2 workarounds in my EB wrapper script:

  if [[ "${SLURM_NODELIST:-}" != "" ]]; then
    ssh $SLURM_NODELIST bash -l "$0"
    exit $?
  fi

This basically restarts the current script via ssh if run from inside a SLURM job assuming only 1 node.

  for i in $(env | grep ^SLURM_ | cut -f1 -d=); do
    unset $i
  done

This removes all SLURM_* variables from the current environment.

As the issue is a common pitfall with EB and given how easy the 2nd variant is to implement in EB via os.environ I'd suggest to do this by default, possibly with a --no-cleanup-slurm-env option to opt-out.

akesandgren commented 8 months ago

Note that not all sites (like us) allow ssh between nodes in a slurm job

Flamefire commented 8 months ago

That was just an example. Both methods seem to work, I had the 2nd in use while our nodes where updated and didn't allow SSH yet. And only the 2nd can be reasonably done in EasyBuild.

boegel commented 8 months ago

Makes sense to me, and it probably makes sense to implement this in EasyBuild 5.0 (although it shouldn't actually break anything, only fix builds that break/hang because they're running in a Slurm environment which may cause trouble with MPI).

boegel commented 7 months ago

We briefly discussed this during the EasyBuild conf call today, and the general consensus seemed to be that this should be made opt-in rather than opt-out (which makes sense to me)