AgnostiqHQ / covalent-slurm-plugin

Executor plugin interfacing Covalent with Slurm
https://covalent.xyz
Apache License 2.0
27 stars 6 forks source link

Add an option, `use_srun: bool`, that can run the Python function without `srun` #62

Closed Andrew-S-Rosen closed 1 year ago

Andrew-S-Rosen commented 1 year ago

What should we add?

Currently, the Python file is always run with srun (with potentially some srun options). An example script might look like the following:

#!/bin/bash
#SBATCH --partition=debug
#SBATCH --account=matgen
#SBATCH --constraint=cpu
#SBATCH --job-name=covalent
#SBATCH --time=00:10:00
#SBATCH --parsable
#SBATCH --output=/global/cfs/cdirs/matgen/arosen/covalent10/stdout-7d947ff3-c38a-452c-a5b5-b03555510d5b-0.log
#SBATCH --error=/global/cfs/cdirs/matgen/arosen/covalent10/stderr-7d947ff3-c38a-452c-a5b5-b03555510d5b-0.log

source $HOME/.bashrc
conda activate covalent
retval=$?
if [ $retval -ne 0 ] ; then
  >&2 echo "Conda environment covalent is not present on the compute node. "  "Please create the environment and try again."
  exit 99
fi

remote_py_version=$(python -c "print('.'.join(map(str, __import__('sys').version_info[:2])))")
if [[ "3.9" != $remote_py_version ]] ; then
  >&2 echo "Python version mismatch. Please install Python 3.9 in the compute environment."
  exit 199
fi

srun \
python /global/cfs/cdirs/matgen/arosen/covalent10/script-7d947ff3-c38a-452c-a5b5-b03555510d5b-0.py

wait

However, as originally noted by my colleague @rwexler and confirmed by me, this can be problematic depending on the use case. For instance, in computational materials science, a common usage pattern is to use SLURM to submit a Python job. That Python job then does several things, including making calls to and running external codes. This is how the Atomic Simulation Environment (ASE) works, which is a tool highlighted in one of the tutorials. You can see an example of how that's done in practice here, if you're curious.

The problem with this approach as a general rule is that if the pickled Python function at some point launches an external, parallel code with srun or mpirun, you now have nested parallelism and you can get very unintended behavior when running on multiple cores/nodes. We want to provide the user a way in which they can have Covalent simply do python <NameOfJob.py> without srun while still launching the workflow as a SLURM job so that the proper resources are reserved. I propose introducing a new kwarg named use_srun: bool, which has a default value of True but can be set to False.

I understand that this usage pattern might be difficult to visualize, so I can definitely explain further. It's a subtle but incredibly important usage pattern though!

Describe alternatives you've considered.

There are no possible alternatives other than cloning the plugin and modifying the line or not using entire software ecosystems like ASE altogether with Covalent.