desispec.parallel.stdouterr_redirected uses desispec.io.util.backup_filename to make a backup instead of clobbering the original, e.g. redrock-blah.log -> redrock-blah.log.1.
However, if the code crashes due to an MPI abort in the middle of a redirect, it leaves behind a bunch of per-rank files like redrock-blah.log_0, redrock-blah.log_1, redrock-blah.log_2 etc. i.e. it never gets a change to run the per-rank merging -> final logfile in the try/finally block (L465). Then if the code is re-run, those files get overwritten, losing the log info from the crash.
I think what we want is for stdouterr_redirected to auto-detect that per-rank files exist, merge those into a single log, and back that up before proceeding with writing new per-rank files.
desispec.parallel.stdouterr_redirected
usesdesispec.io.util.backup_filename
to make a backup instead of clobbering the original, e.g. redrock-blah.log -> redrock-blah.log.1.However, if the code crashes due to an MPI abort in the middle of a redirect, it leaves behind a bunch of per-rank files like redrock-blah.log_0, redrock-blah.log_1, redrock-blah.log_2 etc. i.e. it never gets a change to run the per-rank merging -> final logfile in the try/finally block (L465). Then if the code is re-run, those files get overwritten, losing the log info from the crash.
I think what we want is for
stdouterr_redirected
to auto-detect that per-rank files exist, merge those into a single log, and back that up before proceeding with writing new per-rank files.