desihub / desispec

DESI spectral pipeline
BSD 3-Clause "New" or "Revised" License
36 stars 24 forks source link

Avoid irrelevant information from recycled jobids in sacct #2049

Open marcelo-alvarez opened 1 year ago

marcelo-alvarez commented 1 year ago

The fix implemented in #2037 uses sacct to remove the job dependency flag for jobs that are complete. However, if a given jobid was used previously for an unrelated job (i.e. a "recycled jobid"), then just running sacct -j jobid ... may return misleading information based on the status of the old, unrelated, job that had the same jobid.

One approach, suggested by @dmargala on NERSC INC0203024, could be to limit the time window searched by setting ---starttime to exclude unrelated jobs with the a given jobid. Under the assumption that jobids are only recycled after several years, setting --starttime to a year before runtime should work @sbailey.

akremin commented 1 year ago

This will work in the sense that things processed a long time ago are ideally dealt with. However, if we decided to reprocess a tile in daily from 2 years ago, those will correctly depend on calibration jobid's from 2 years ago. And if those calibrations failed for some reason, it is useful to know that.

This timelimit is basically the same thing as saying "Anything older than XX days/weeks must have been figured out by now." That is basically always true, but is fragile and depends on humans rather than a robust job scheduler.

marcelo-alvarez commented 1 year ago

@akremin I agree with your point and also don't consider it a desirable feature for jobids to get recycled in this way.

sbailey commented 1 year ago

A subtlety: currently we remove jobids that queue_info_from_qids reports as COMPLETED. If we add an sacct --starttime option, then queue_info_from_qids won't find old dependencies and won't include them in the returned table and they wouldn't be removed, which would then trigger the original bug of slurm not allowing us to set dependencies on already completed jobs. OK, so we could up change the logic to "remove completed job or ones that aren't found by sacct ...". BUT — in the case that triggered NERSC INC0203024, it appears that sacct also wasn't finding a stdstar job that had been submitted immediately before (it seemed to need a second or so to catch up and have it appear in an sacct query). In that case we did need to include it for the current poststdstar job that we were trying to submit. Ugh.

It's possible that in the INC0203024 case it did know about the latest job, but was returning it in a different order than 1 second later. Maybe. Currently this is a major headache on Cori where job submissions regularly trip on this, but we haven't seen it on Perlmutter yet...