dmwm / CRABServer

16 stars 38 forks source link

StatusTracking should handle tail jobs in automatic splitting #8794

Closed belforte closed 1 week ago

belforte commented 1 week ago

if procssing step fails, but task completes via the tail jobs, the ST script does not notice but keeps seeing a failed job in the status summary and tries to resubmit. So test is stuck forever in "testResubmitted" even if task is OK.

belforte commented 1 week ago

this was triggered by https://cmsweb-testbed.cern.ch/crabserver/ui/task/241113_203248%3Acrabint1_crab_20241113_213248 even after I made it complete successfully, the CI pipeline test was failing

belforte commented 1 week ago

At first sigh one problem is here https://github.com/dmwm/CRABServer/blob/6a3754dd8e4478e1dd10a8e9bd1c74918881248d/cicd/gitlab/st/statusTracking.py#L39-L42 that code removes also tail jobs, which is not what the comment says

belforte commented 1 week ago

But the real problem is that this script test is based on job counting. In this case there are 5 probes (one failing), 1 processing (failed), one tail (OK). Namely

(Pdb) status_command_output['jobsPerStatus']
{'finished': 5, 'failed': 2}
(Pdb) status_command_output['jobList']
[['finished', '0-5'], ['finished', '0-3'], ['failed', '0-4'], ['finished', '0-1'], ['finished', '0-2'], ['failed', '1'], ['finished', '1-1']]
(Pdb) 

Need some smarter logic to tell that "yes, one job failed, but tail stage took care". Of course we can't gliss over failed tails like done for probes, but if we sometimes run a larger task with automatic splitting will also face the problem that there can be multiple tail stages and number of jobs is not defined.

belforte commented 1 week ago

maybe "as simple as"