esm-tools / esm_tools

Simple Infrastructure for Earth System Simulations
https://esm-tools.github.io/
GNU General Public License v2.0
25 stars 12 forks source link

multiple postprocessing scripts screw script generation #554

Closed seb-wahl closed 2 years ago

seb-wahl commented 2 years ago

Describe the bug If I add a postprocessing section for two components (echam and nemo in foci.yaml) of a coupled setup (see https://github.com/esm-tools/esm_tools/blob/ab2e86ead81a83e6f043ff8d403056378cf8c9d1/configs/setups/foci/foci.yaml#L281-L299 and https://github.com/esm-tools/esm_tools/blob/ab2e86ead81a83e6f043ff8d403056378cf8c9d1/configs/setups/foci/foci.yaml#L385-L403 then, one postprocessing script is generated that contains the call to both postprocessing scripts:

#!/bin/bash
#SBATCH --partition=standard96
#SBATCH --time=06:00:00
#SBATCH --ntasks=2
#SBATCH --output=/home/shkifmsw/esm/esm-experiments//viking10_initial_pp2/log/viking10_initial_pp2_foci_postprocessing_18510101-18511231.log --error=/home/shkifmsw/esm/esm-experiments//viking10_initial_pp2/log/viking10_initial_pp2_foci_postprocessing_18510101-18511231.log
#SBATCH --job-name=viking10_initial_pp2
#SBATCH --account=shk00018
#SBATCH --mail-type=NONE
#SBATCH --ntasks-per-core=1
#SBATCH --exclusive

module purge
module load slurm
module load HLRNenv
module load sw.skl
module load cmake
module load cdo nco
module load intel/19.0.5
module load impi/2019.5
source $I_MPI_ROOT/intel64/bin/mpivars.sh release_mt

export LC_ALL=en_US.UTF-8
export FC=mpiifort
export F77=mpiifort
export MPIFC=mpiifort
export FCFLAGS=-free
export CC=mpiicc
export CXX=mpiicpc
export MPIROOT="$(mpiifort -show | perl -lne 'm{ -I(.*?)/include } and print $1')"
export MPI_LIB="$(mpiifort -show |sed -e 's/^[^ ]*//' -e 's/-[I][^ ]*//g')"
export IO_LIB_ROOT=/home/shkifmsw/sw/HPC_libraries/intel2019.0.5_impi2019.5_20200811
export PATH=$IO_LIB_ROOT/bin:$PATH
export LD_LIBRARY_PATH=$IO_LIB_ROOT/lib:$LD_LIBRARY_PATH
export SZIPROOT=$IO_LIB_ROOT
export HDF5ROOT=$IO_LIB_ROOT
export HDF5_ROOT=$HDF5ROOT
export NETCDFROOT=$IO_LIB_ROOT
export NETCDFFROOT=$IO_LIB_ROOT
export ECCODESROOT=$IO_LIB_ROOT
export HDF5_C_INCLUDE_DIRECTORIES=$HDF5_ROOT/include
export NETCDF_Fortran_INCLUDE_DIRECTORIES=$NETCDFFROOT/include
export NETCDF_C_INCLUDE_DIRECTORIES=$NETCDFROOT/include
export NETCDF_CXX_INCLUDE_DIRECTORIES=$NETCDFROOT/include
export OASIS3MCT_FC_LIB="-L$NETCDFFROOT/lib -lnetcdff"
export HOME=/home/shkifmsw
export ENVIRONMENT_SET_BY_ESMTOOLS=TRUE

# Set stack size to unlimited
ulimit -s unlimited
# 3...2...1...Liftoff!

echo $(date +"%a %b  %e %T %Y") : postprocessing_echam 2 1851-01-01T00:00:00 3403556 - start >> /home/shkifmsw/esm/esm-experiments//viking10_initial_pp2/log//viking10_initial_pp2_foci.log

cd /home/shkifmsw/esm/esm-experiments//viking10_initial_pp2/run_18510101-18511231/work/
time srun --mpi=pmi2 -l --kill-on-bad-exit=1 --cpu_bind=cores --distribution=cyclic:cyclic --export=ALL /home/shkifmsw/esm/esm_tools/configs/.//setups/foci/echam_postprocessing.sh -r viking10_initial_pp2 -s 1851 -e 1851 2>&1 &
module purge
module load slurm
module load HLRNenv
module load sw.skl
module load cmake
module load cdo nco
module load intel/19.0.5
module load impi/2019.5
source $I_MPI_ROOT/intel64/bin/mpivars.sh release_mt

export LC_ALL=en_US.UTF-8
export FC=mpiifort
export F77=mpiifort
export MPIFC=mpiifort
export FCFLAGS=-free
export CC=mpiicc
export CXX=mpiicpc
export MPIROOT="$(mpiifort -show | perl -lne 'm{ -I(.*?)/include } and print $1')"
export MPI_LIB="$(mpiifort -show |sed -e 's/^[^ ]*//' -e 's/-[I][^ ]*//g')"
export IO_LIB_ROOT=/home/shkifmsw/sw/HPC_libraries/intel2019.0.5_impi2019.5_20200811
export PATH=$IO_LIB_ROOT/bin:$PATH
export LD_LIBRARY_PATH=$IO_LIB_ROOT/lib:$LD_LIBRARY_PATH
export SZIPROOT=$IO_LIB_ROOT
export HDF5ROOT=$IO_LIB_ROOT
export HDF5_ROOT=$HDF5ROOT
export NETCDFROOT=$IO_LIB_ROOT
export NETCDFFROOT=$IO_LIB_ROOT
export ECCODESROOT=$IO_LIB_ROOT
export HDF5_C_INCLUDE_DIRECTORIES=$HDF5_ROOT/include
export NETCDF_Fortran_INCLUDE_DIRECTORIES=$NETCDFFROOT/include
export NETCDF_C_INCLUDE_DIRECTORIES=$NETCDFROOT/include
export NETCDF_CXX_INCLUDE_DIRECTORIES=$NETCDFROOT/include
export OASIS3MCT_FC_LIB="-L$NETCDFFROOT/lib -lnetcdff"
export HOME=/home/shkifmsw
export ENVIRONMENT_SET_BY_ESMTOOLS=TRUE

# Set stack size to unlimited
ulimit -s unlimited
# 3...2...1...Liftoff!

echo $(date +"%a %b  %e %T %Y") : postprocessing_nemo 2 1851-01-01T00:00:00 3403556 - start >> /home/shkifmsw/esm/esm-experiments//viking10_initial_pp2/log//viking10_initial_pp2_foci.log

cd /home/shkifmsw/esm/esm-experiments//viking10_initial_pp2/run_18510101-18511231/work/
time srun --mpi=pmi2 -l --kill-on-bad-exit=1 --cpu_bind=cores --distribution=cyclic:cyclic --export=ALL /home/shkifmsw/esm/esm_tools/configs/.//setups/foci/nemo_postprocessing.sh -m -r viking10_initial_pp2 -s 1851 -e 1851 2>&1 &

wait

This script is called TWICE add exactly the same time by the workflow manager. To Reproduce see above

Expected behavior Two scripts that are executed independent of each other as separate jobs. I went through the code a bit and there a things such as subjob_clusters etc. but I couldn't find any documentation on how to use those settings.

Screenshots If applicable, add screenshots to help explain your problem.

System (please complete the following information):

mandresm commented 2 years ago

Hi @seb-wahl ,

This is indeed weird. I would have expected that naming the two subjobs the same (postprocessing) would have resulted into running one or the other, not both in the same script. Can you try changing the names of the subjobs? postprocessing_echam and postprocessing_nemo, or something on those lines?

Once we have a clue on how this really works then I'll add some error handling and documentation so no one else has to run into the same trap.

seb-wahl commented 2 years ago

If I change the name of the subjobs it works as expected. I guess somewhere in the code of the workflow manager all scripts behind a name are concatenated.