Open marcelo-alvarez opened 1 year ago
This will work in the sense that things processed a long time ago are ideally dealt with. However, if we decided to reprocess a tile in daily from 2 years ago, those will correctly depend on calibration jobid's from 2 years ago. And if those calibrations failed for some reason, it is useful to know that.
This timelimit is basically the same thing as saying "Anything older than XX days/weeks must have been figured out by now." That is basically always true, but is fragile and depends on humans rather than a robust job scheduler.
@akremin I agree with your point and also don't consider it a desirable feature for jobids to get recycled in this way.
A subtlety: currently we remove jobids that queue_info_from_qids
reports as COMPLETED. If we add an sacct --starttime
option, then queue_info_from_qids
won't find old dependencies and won't include them in the returned table and they wouldn't be removed, which would then trigger the original bug of slurm not allowing us to set dependencies on already completed jobs. OK, so we could up change the logic to "remove completed job or ones that aren't found by sacct ...". BUT — in the case that triggered NERSC INC0203024, it appears that sacct also wasn't finding a stdstar job that had been submitted immediately before (it seemed to need a second or so to catch up and have it appear in an sacct query). In that case we did need to include it for the current poststdstar job that we were trying to submit. Ugh.
It's possible that in the INC0203024 case it did know about the latest job, but was returning it in a different order than 1 second later. Maybe. Currently this is a major headache on Cori where job submissions regularly trip on this, but we haven't seen it on Perlmutter yet...
The fix implemented in #2037 uses
sacct
to remove the job dependency flag for jobs that are complete. However, if a given jobid was used previously for an unrelated job (i.e. a "recycled jobid"), then just runningsacct -j jobid ...
may return misleading information based on the status of the old, unrelated, job that had the same jobid.One approach, suggested by @dmargala on NERSC INC0203024, could be to limit the time window searched by setting
---starttime
to exclude unrelated jobs with the a given jobid. Under the assumption that jobids are only recycled after several years, setting--starttime
to a year before runtime should work @sbailey.