desihub / desispec

DESI spectral pipeline
BSD 3-Clause "New" or "Revised" License
36 stars 24 forks source link

stdouterr_redirected recovery from mid-stream crash #2269

Closed sbailey closed 2 months ago

sbailey commented 4 months ago

desispec.parallel.stdouterr_redirected uses desispec.io.util.backup_filename to make a backup instead of clobbering the original, e.g. redrock-blah.log -> redrock-blah.log.1.

However, if the code crashes due to an MPI abort in the middle of a redirect, it leaves behind a bunch of per-rank files like redrock-blah.log_0, redrock-blah.log_1, redrock-blah.log_2 etc. i.e. it never gets a change to run the per-rank merging -> final logfile in the try/finally block (L465). Then if the code is re-run, those files get overwritten, losing the log info from the crash.

I think what we want is for stdouterr_redirected to auto-detect that per-rank files exist, merge those into a single log, and back that up before proceeding with writing new per-rank files.