Closed craigwarner-ufastro closed 3 months ago
@craigwarner-ufastro submitting the following slurm script
#!/bin/bash -l
#SBATCH --account=desi
#SBATCH --qos=debug
#SBATCH --constraint=gpu
#SBATCH --mail-user=jmoustakas@siena.edu
#SBATCH --mail-type=ALL
#SBATCH --nodes=4
#SBATCH --time=00:30:00
#SBATCH --output=/pscratch/sd/i/ioannis/Y3-templates/scripts_and_logs/run-redrock-iron-cumulative-vi-main-Y3-0.1-zscan01-%j.log
source /global/common/software/desi/desi_environment.sh main
export PATH=${HOME}/code/desihub/redrock/bin:${PATH}
export PYTHONPATH=${HOME}/code/desihub/redrock/py:${PYTHONPATH}
export PATH=${HOME}/code/desihub/desispec/bin:${PATH}
export PYTHONPATH=${HOME}/code/desihub/desispec/py:${PYTHONPATH}
export RR_TEMPLATE_DIR=/pscratch/sd/i/ioannis/Y3-templates/rrtemplates/Y3-0.1
cmd="srun --ntasks=16 --cpus-per-task=2 --gpu-bind=map_gpu:3,2,1,0 wrap_rrdesi --input=/pscratch/sd/i/ioannis/Y3-templates/redrock/iron-cumulative-vi-main/Y3-0.1-zscan01/coadd-filelist.txt --outp
ut=/pscratch/sd/i/ioannis/Y3-templates/redrock/iron-cumulative-vi-main/Y3-0.1-zscan01 --rrdetails --gpu"
echo $cmd
$cmd
leads to the following (catastrophic) output:
srun --ntasks=16 --cpus-per-task=2 --gpu-bind=map_gpu:3,2,1,0 wrap_rrdesi --input=/pscratch/sd/i/ioannis/Y3-templates/redrock/iron-cumulative-vi-main/Y3-0.1-zscan01/coadd-filelist.txt
--output=/pscratch/sd/i/ioannis/Y3-templates/redrock/iron-cumulative-vi-main/Y3-0.1-zscan01 --rrdetails --gpu
Running 73 input files on 16 GPUs and 16 total procs...
ERROR: cupy or GPU not available
MPICH Notice [Rank 0] [job id 23712748.0] [Sat Mar 30 18:11:36 2024] [nid001733] - Abort(0) (rank 0 in comm 496): application called MPI_Abort(comm=0x84000002, 0) - process 0
ERROR: cupy or GPU not available
MPICH Notice [Rank 1] [job id 23712748.0] [Sat Mar 30 18:11:36 2024] [nid001733] - Abort(0) (rank 0 in comm 496): application called MPI_Abort(comm=0x84000001, 0) - process 0
ERROR: cupy or GPU not available
MPICH Notice [Rank 2] [job id 23712748.0] [Sat Mar 30 18:11:36 2024] [nid001733] - Abort(0) (rank 0 in comm 496): application called MPI_Abort(comm=0x84000001, 0) - process 0
ERROR: cupy or GPU not available
MPICH Notice [Rank 3] [job id 23712748.0] [Sat Mar 30 18:11:36 2024] [nid001733] - Abort(0) (rank 0 in comm 496): application called MPI_Abort(comm=0x84000001, 0) - process 0
[snip]
Can you help diagnose?
Thanks to @sbailey for the fix; just needed
#SBATCH --gpus-per-node=4
to actually get GPUs.
@akremin I made a change to a bit of code that you wrote which is only tangentially related to this PR. Can you please review? https://github.com/desihub/desispec/pull/2196/commits/4a48d4418db2b7ab87fafbdbb441a7fad4d3f9bd
In essence, if coadd
is in the directory filepath to the input coadd
file to QuasarNet, then the code in main
crashes because the replace doesn't take into account this corner case. I made a (trivial) change to handle this situation, which impacts my ongoing template work.
@moustakas we've tripped on that "change the file prefix" corner case bug multiple times elsewhere; desispec.io.util.replace_prefix(filepath, oldprefix, newprefix)
is the intended standard solution.
https://github.com/desihub/desispec/blob/main/py/desispec/io/util.py#L913
Thanks @sbailey I knew that functionality existed somewhere, but I couldn't find it.
At today's data telecon, John described this new script as "I think I'm in love", so I'll take that as approval and merge this. If there are additional features / bugfixes needed, please open new tickets/PRs.
and run redrock on 1 GPU per file and output to the specified output directory. Extra non-GPU nodes will be used to run additional files in CPU mode (default up to 32 CPU per task) unless otherwise specified.
E.g. on 2 nodes with 4 GPU each, a list of 50 input files: srun -N 2 -n 64 -c 2 --gpu-bind=map_gpu:3,2,1,0 wrap_rrdesi -i \ input_files.txt -o outputdir --gpu will run 1 file on each of the 8 GPUs and also run two additional CPU only tasks with 28 CPU per task so 10 files will be run simultaneously and the loop will itereate 5 times.
srun -N 2 -n 8 -c 2 --gpu-bind=map_gpu:3,2,1,0 wrap_rrdesi -i \ input_files.txt -o outputdir --gpu will result in 1 file run on each of the 8 GPUs and there are no additional CPU only tasks available so 8 files will be run simultaneously and the loop will iterate 7 times with the last iteration only using 2 of the 8 GPUs.
srun -N 2 -n 64 -c 2 --gpu-bind=map_gpu:3,2,1,0 wrap_rrdesi -i \ input_files.txt -o outputdir --gpu --gpuonly is functionally equivalent to the above because only GPU nodes will be used with the --gpuonly argument.
--cpu-per-task defaults to 32 if not given and controls the maxmimum number of CPUs to use in CPU only tasks
If an output file exists, it will not be overwritten unless --overwrite is specified.
--inputdir may be optionally specified and is prepended to every file in the input file list
To run without any GPUs, in CPU mode only omit --gpu.
--gpu-bind=map_gpu:3,2,1,0 is a required argument to srun when using GPUs.