desihub / desispec

DESI spectral pipeline
BSD 3-Clause "New" or "Revised" License
33 stars 24 forks source link

Added new script wrap_rrdesi that will take an input ASCII list of files #2196

Closed craigwarner-ufastro closed 3 months ago

craigwarner-ufastro commented 4 months ago

and run redrock on 1 GPU per file and output to the specified output directory. Extra non-GPU nodes will be used to run additional files in CPU mode (default up to 32 CPU per task) unless otherwise specified.

E.g. on 2 nodes with 4 GPU each, a list of 50 input files: srun -N 2 -n 64 -c 2 --gpu-bind=map_gpu:3,2,1,0 wrap_rrdesi -i \ input_files.txt -o outputdir --gpu will run 1 file on each of the 8 GPUs and also run two additional CPU only tasks with 28 CPU per task so 10 files will be run simultaneously and the loop will itereate 5 times.

srun -N 2 -n 8 -c 2 --gpu-bind=map_gpu:3,2,1,0 wrap_rrdesi -i \ input_files.txt -o outputdir --gpu will result in 1 file run on each of the 8 GPUs and there are no additional CPU only tasks available so 8 files will be run simultaneously and the loop will iterate 7 times with the last iteration only using 2 of the 8 GPUs.

srun -N 2 -n 64 -c 2 --gpu-bind=map_gpu:3,2,1,0 wrap_rrdesi -i \ input_files.txt -o outputdir --gpu --gpuonly is functionally equivalent to the above because only GPU nodes will be used with the --gpuonly argument.

--cpu-per-task defaults to 32 if not given and controls the maxmimum number of CPUs to use in CPU only tasks

If an output file exists, it will not be overwritten unless --overwrite is specified.

--inputdir may be optionally specified and is prepended to every file in the input file list

To run without any GPUs, in CPU mode only omit --gpu.

--gpu-bind=map_gpu:3,2,1,0 is a required argument to srun when using GPUs.

moustakas commented 3 months ago

@craigwarner-ufastro submitting the following slurm script

#!/bin/bash -l

#SBATCH --account=desi
#SBATCH --qos=debug
#SBATCH --constraint=gpu
#SBATCH --mail-user=jmoustakas@siena.edu
#SBATCH --mail-type=ALL
#SBATCH --nodes=4
#SBATCH --time=00:30:00
#SBATCH --output=/pscratch/sd/i/ioannis/Y3-templates/scripts_and_logs/run-redrock-iron-cumulative-vi-main-Y3-0.1-zscan01-%j.log

source /global/common/software/desi/desi_environment.sh main
export PATH=${HOME}/code/desihub/redrock/bin:${PATH}
export PYTHONPATH=${HOME}/code/desihub/redrock/py:${PYTHONPATH}
export PATH=${HOME}/code/desihub/desispec/bin:${PATH}
export PYTHONPATH=${HOME}/code/desihub/desispec/py:${PYTHONPATH}
export RR_TEMPLATE_DIR=/pscratch/sd/i/ioannis/Y3-templates/rrtemplates/Y3-0.1

cmd="srun --ntasks=16 --cpus-per-task=2 --gpu-bind=map_gpu:3,2,1,0 wrap_rrdesi --input=/pscratch/sd/i/ioannis/Y3-templates/redrock/iron-cumulative-vi-main/Y3-0.1-zscan01/coadd-filelist.txt --outp
ut=/pscratch/sd/i/ioannis/Y3-templates/redrock/iron-cumulative-vi-main/Y3-0.1-zscan01 --rrdetails --gpu"
echo $cmd
$cmd

leads to the following (catastrophic) output:

srun --ntasks=16 --cpus-per-task=2 --gpu-bind=map_gpu:3,2,1,0 wrap_rrdesi --input=/pscratch/sd/i/ioannis/Y3-templates/redrock/iron-cumulative-vi-main/Y3-0.1-zscan01/coadd-filelist.txt
--output=/pscratch/sd/i/ioannis/Y3-templates/redrock/iron-cumulative-vi-main/Y3-0.1-zscan01 --rrdetails --gpu
Running 73 input files on 16 GPUs and 16 total procs...
ERROR: cupy or GPU not available
MPICH Notice [Rank 0] [job id 23712748.0] [Sat Mar 30 18:11:36 2024] [nid001733] - Abort(0) (rank 0 in comm 496): application called MPI_Abort(comm=0x84000002, 0) - process 0

ERROR: cupy or GPU not available
MPICH Notice [Rank 1] [job id 23712748.0] [Sat Mar 30 18:11:36 2024] [nid001733] - Abort(0) (rank 0 in comm 496): application called MPI_Abort(comm=0x84000001, 0) - process 0

ERROR: cupy or GPU not available
MPICH Notice [Rank 2] [job id 23712748.0] [Sat Mar 30 18:11:36 2024] [nid001733] - Abort(0) (rank 0 in comm 496): application called MPI_Abort(comm=0x84000001, 0) - process 0

ERROR: cupy or GPU not available
MPICH Notice [Rank 3] [job id 23712748.0] [Sat Mar 30 18:11:36 2024] [nid001733] - Abort(0) (rank 0 in comm 496): application called MPI_Abort(comm=0x84000001, 0) - process 0
[snip]

Can you help diagnose?

moustakas commented 3 months ago

Thanks to @sbailey for the fix; just needed

#SBATCH --gpus-per-node=4

to actually get GPUs.

moustakas commented 3 months ago

@akremin I made a change to a bit of code that you wrote which is only tangentially related to this PR. Can you please review? https://github.com/desihub/desispec/pull/2196/commits/4a48d4418db2b7ab87fafbdbb441a7fad4d3f9bd

In essence, if coadd is in the directory filepath to the input coadd file to QuasarNet, then the code in main crashes because the replace doesn't take into account this corner case. I made a (trivial) change to handle this situation, which impacts my ongoing template work.

sbailey commented 3 months ago

@moustakas we've tripped on that "change the file prefix" corner case bug multiple times elsewhere; desispec.io.util.replace_prefix(filepath, oldprefix, newprefix) is the intended standard solution.

https://github.com/desihub/desispec/blob/main/py/desispec/io/util.py#L913

moustakas commented 3 months ago

Thanks @sbailey I knew that functionality existed somewhere, but I couldn't find it.

sbailey commented 3 months ago

At today's data telecon, John described this new script as "I think I'm in love", so I'll take that as approval and merge this. If there are additional features / bugfixes needed, please open new tickets/PRs.