workflow.schedule.Schedule used by desi_proc -> specex hangs if the MPI communicator is larger than needed. In the case of specex processing one camera, it needs a communicator of size 21 = 20 bundles + 1 coordinator rank.
currently this command fails on some bundles, reports that, and properly exits:
but if you give it too many ranks it reports the error, but then the ranks that didn't participate in the PSF fitting get stuck at some barrier that the failed ranks never get to:
i.e. if you call the scheduler with the "right" number of ranks, it works, but it would be better if you could call it with any number of ranks there were enough for the problem, and the extra ranks would not cause the scheduler to get stuck even if there was an error in the processing.
Note: this ticket is about the scheduler itself; the underlying PSF problems are in ticket #2202. After that ticket is addressed, the above example commands might start succeeding. I don't have time to put together a toy reproducer right now, but I'm reporting it for the record.
workflow.schedule.Schedule
used by desi_proc -> specex hangs if the MPI communicator is larger than needed. In the case of specex processing one camera, it needs a communicator of size 21 = 20 bundles + 1 coordinator rank.currently this command fails on some bundles, reports that, and properly exits:
but if you give it too many ranks it reports the error, but then the ranks that didn't participate in the PSF fitting get stuck at some barrier that the failed ranks never get to:
i.e. if you call the scheduler with the "right" number of ranks, it works, but it would be better if you could call it with any number of ranks there were enough for the problem, and the extra ranks would not cause the scheduler to get stuck even if there was an error in the processing.
Note: this ticket is about the scheduler itself; the underlying PSF problems are in ticket #2202. After that ticket is addressed, the above example commands might start succeeding. I don't have time to put together a toy reproducer right now, but I'm reporting it for the record.