Largest healpix jobs in Jura crash without logging an error message

When tiles 27256 and 27258 were run in Jura, they failed due to having over 2000 and 4000 input files respectively. The original scripts:

/global/cfs/cdirs/desi/spectro/redux/jura/run/scripts/healpix/special/other/272/zpix-special-other-27256.slurm
/global/cfs/cdirs/desi/spectro/redux/jura/run/scripts/healpix/special/other/272/zpix-special-other-27258.slurm

failed, as did others in Jura with large memory footprints.

We tried reducing the number of MPI ranks, which worked on some other tiles with large memory footprints:

/global/cfs/cdirs/desi/spectro/redux/jura/run/scripts/healpix/special/other/272/zpix-special-other-27256-lowrank.slurm
/global/cfs/cdirs/desi/spectro/redux/jura/run/scripts/healpix/special/other/272/zpix-special-other-27258-lowrank.slurm

but that didn't work either.

Running with a single MPI rank on a CPU node also failed. What eventually worked was running the explicit python command desi_group_spectra in serial on a CPU node, then running with normal MPI for redrock and the afterburners.

What made this case harder to debug was that the error messages were not propagated to the logs. So two things should be done to mitigate this: 1) Identify why the error isn't logged and fix that. 2) Solve the underlying issue that caused the scripts to fail, likely through reducing ranks or eliminating MPI for these extreme cases.

desihub / desispec

Largest healpix jobs in Jura crash without logging an error message #2277