desihub / desispec

DESI spectral pipeline
BSD 3-Clause "New" or "Revised" License
33 stars 24 forks source link

workflow.schedule.Schedule can hang instead of exit if MPI exceptions are raised #2209

Open sbailey opened 3 months ago

sbailey commented 3 months ago

workflow.schedule.Schedule used by desi_proc -> specex hangs if the MPI communicator is larger than needed. In the case of specex processing one camera, it needs a communicator of size 21 = 20 bundles + 1 coordinator rank.

currently this command fails on some bundles, reports that, and properly exits:

srun -n 21 -c 8 desi_proc --mpi -n 20230428 -e 177975 --cameras b8

but if you give it too many ranks it reports the error, but then the ranks that didn't participate in the PSF fitting get stuck at some barrier that the failed ranks never get to:

srun -n 22 -c 8 desi_proc --mpi -n 20230428 -e 177975 --cameras b8

i.e. if you call the scheduler with the "right" number of ranks, it works, but it would be better if you could call it with any number of ranks there were enough for the problem, and the extra ranks would not cause the scheduler to get stuck even if there was an error in the processing.

Note: this ticket is about the scheduler itself; the underlying PSF problems are in ticket #2202. After that ticket is addressed, the above example commands might start succeeding. I don't have time to put together a toy reproducer right now, but I'm reporting it for the record.