desihub / desispec

DESI spectral pipeline
BSD 3-Clause "New" or "Revised" License
33 stars 24 forks source link

Largest healpix jobs in Jura crash without logging an error message #2277

Open akremin opened 3 weeks ago

akremin commented 3 weeks ago

When tiles 27256 and 27258 were run in Jura, they failed due to having over 2000 and 4000 input files respectively. The original scripts:

/global/cfs/cdirs/desi/spectro/redux/jura/run/scripts/healpix/special/other/272/zpix-special-other-27256.slurm
/global/cfs/cdirs/desi/spectro/redux/jura/run/scripts/healpix/special/other/272/zpix-special-other-27258.slurm

failed, as did others in Jura with large memory footprints.

We tried reducing the number of MPI ranks, which worked on some other tiles with large memory footprints:

/global/cfs/cdirs/desi/spectro/redux/jura/run/scripts/healpix/special/other/272/zpix-special-other-27256-lowrank.slurm
/global/cfs/cdirs/desi/spectro/redux/jura/run/scripts/healpix/special/other/272/zpix-special-other-27258-lowrank.slurm

but that didn't work either.

Running with a single MPI rank on a CPU node also failed. What eventually worked was running the explicit python command desi_group_spectra in serial on a CPU node, then running with normal MPI for redrock and the afterburners.

What made this case harder to debug was that the error messages were not propagated to the logs. So two things should be done to mitigate this: 1) Identify why the error isn't logged and fix that. 2) Solve the underlying issue that caused the scripts to fail, likely through reducing ranks or eliminating MPI for these extreme cases.