Currently, the Python file is always run with srun (with potentially some srun options). An example script might look like the following:
#!/bin/bash
#SBATCH --partition=debug
#SBATCH --account=matgen
#SBATCH --constraint=cpu
#SBATCH --job-name=covalent
#SBATCH --time=00:10:00
#SBATCH --parsable
#SBATCH --output=/global/cfs/cdirs/matgen/arosen/covalent10/stdout-7d947ff3-c38a-452c-a5b5-b03555510d5b-0.log
#SBATCH --error=/global/cfs/cdirs/matgen/arosen/covalent10/stderr-7d947ff3-c38a-452c-a5b5-b03555510d5b-0.log
source $HOME/.bashrc
conda activate covalent
retval=$?
if [ $retval -ne 0 ] ; then
>&2 echo "Conda environment covalent is not present on the compute node. " "Please create the environment and try again."
exit 99
fi
remote_py_version=$(python -c "print('.'.join(map(str, __import__('sys').version_info[:2])))")
if [[ "3.9" != $remote_py_version ]] ; then
>&2 echo "Python version mismatch. Please install Python 3.9 in the compute environment."
exit 199
fi
srun \
python /global/cfs/cdirs/matgen/arosen/covalent10/script-7d947ff3-c38a-452c-a5b5-b03555510d5b-0.py
wait
However, as originally noted by my colleague @rwexler and confirmed by me, this can be problematic depending on the use case. For instance, in computational materials science, a common usage pattern is to use SLURM to submit a Python job. That Python job then does several things, including making calls to and running external codes. This is how the Atomic Simulation Environment (ASE) works, which is a tool highlighted in one of the tutorials. You can see an example of how that's done in practice here, if you're curious.
The problem with this approach as a general rule is that if the pickled Python function at some point launches an external, parallel code with srun or mpirun, you now have nested parallelism and you can get very unintended behavior when running on multiple cores/nodes. We want to provide the user a way in which they can have Covalent simply do python <NameOfJob.py> without srun while still launching the workflow as a SLURM job so that the proper resources are reserved. I propose introducing a new kwarg named use_srun: bool, which has a default value of True but can be set to False.
I understand that this usage pattern might be difficult to visualize, so I can definitely explain further. It's a subtle but incredibly important usage pattern though!
Describe alternatives you've considered.
There are no possible alternatives other than cloning the plugin and modifying the line or not using entire software ecosystems like ASE altogether with Covalent.
What should we add?
Currently, the Python file is always run with
srun
(with potentially some srun options). An example script might look like the following:However, as originally noted by my colleague @rwexler and confirmed by me, this can be problematic depending on the use case. For instance, in computational materials science, a common usage pattern is to use SLURM to submit a Python job. That Python job then does several things, including making calls to and running external codes. This is how the Atomic Simulation Environment (ASE) works, which is a tool highlighted in one of the tutorials. You can see an example of how that's done in practice here, if you're curious.
The problem with this approach as a general rule is that if the pickled Python function at some point launches an external, parallel code with
srun
ormpirun
, you now have nested parallelism and you can get very unintended behavior when running on multiple cores/nodes. We want to provide the user a way in which they can have Covalent simply dopython <NameOfJob.py>
withoutsrun
while still launching the workflow as a SLURM job so that the proper resources are reserved. I propose introducing a new kwarg nameduse_srun: bool
, which has a default value ofTrue
but can be set toFalse
.I understand that this usage pattern might be difficult to visualize, so I can definitely explain further. It's a subtle but incredibly important usage pattern though!
Describe alternatives you've considered.
There are no possible alternatives other than cloning the plugin and modifying the line or not using entire software ecosystems like ASE altogether with Covalent.