Open belforte opened 10 hours ago
probe #4 failed, but that's "normal". Usually it does not result in error. Also main processing was submitted, in spite of the error message. But it failed and tails jobs were not submitted.
looks like the problem is to run a tail step w/o info on the main one. The file automatic_splitting/processed
only has info from the probe jobs
>>> f=open('automatic_splitting/processed','rb')
>>> r=pickle.load(f)
>>> r
{'0-1', '0-3', '0-4', '0-5', '0-2'}
>>>
which may have something to do with the odd log
Wed, 13 Nov 2024 21:59:30 CET(+0100):INFO:PreDAG found 1 completed jobs
Wed, 13 Nov 2024 21:59:30 CET(+0100):INFO:PreDAG jobs remaining to process: 1
Wed, 13 Nov 2024 21:59:30 CET(+0100):INFO:PreDAG found 5 completed jobs
Wed, 13 Nov 2024 21:59:30 CET(+0100):INFO:PreDAG jobs remaining to process: 1
last lines of dag_bootstrap.out
Entering TaskManagerBootstrap with args: ['/data/srv/glidecondor/condor_local/spool/6837/0/cluster9976837.proc0.subproc0/TaskWorker/TaskManagerBootstrap.py', 'PREDAG', 'tail', '1', '1']
Wed, 13 Nov 2024 21:59:30 CET(+0100):DEBUG:PreDAG Acquiring PreDAG lock
Wed, 13 Nov 2024 21:59:30 CET(+0100):DEBUG:PreDAG PreDAGlock acquired
after some debugging I believe that in here https://github.com/dmwm/CRABServer/blob/3b2e705c8c5458135c57dfe88d1107aea075c26f/src/python/TaskWorker/Actions/PreDAG.py#L112-L127 Line 123 is wrong, it should be instead
if state == 'failed' and not processFailed:
That will make the error in this issue go away. But since things are usually working, there may be something more.
I believe this problem only happens when all processing
jobs fail (in our test there is only one such job ! so chances of this increase) and some probe failed.
This lines tell PreDag to use probe jobs for the new splitting estimate since processing jobs failed
https://github.com/dmwm/CRABServer/blob/3b2e705c8c5458135c57dfe88d1107aea075c26f/src/python/TaskWorker/Actions/PreDAG.py#L200-L201
but for reasons obscure to me processFailed
is set to False
which makes failed jobs NOT skipped (!!)
https://github.com/dmwm/CRABServer/blob/3b2e705c8c5458135c57dfe88d1107aea075c26f/src/python/TaskWorker/Actions/PreDAG.py#L123-L124
while "normally" the default processFailed=True
is used
https://github.com/dmwm/CRABServer/blob/3b2e705c8c5458135c57dfe88d1107aea075c26f/src/python/TaskWorker/Actions/PreDAG.py#L174
Surely variable naming is confusing !
Also the fact that the completedJobs()
method looks for completed jobs in the stage preceding the one indicated as argument
https://github.com/dmwm/CRABServer/blob/3b2e705c8c5458135c57dfe88d1107aea075c26f/src/python/TaskWorker/Actions/PreDAG.py#L117-L118
At this point I have no idea why the processFailed
argument was introduced. Seems like the original implementor wanted to say:
"if all jobs in this stage failed, go back to previous stage to generate splitting parameters, but take into account also failed jobs which were skipped before."
But failed jobs have no throughput report, so can't be used !
I made that task DAG complete successfully by rerunning PreDag manually after changing https://github.com/dmwm/CRABServer/blob/3b2e705c8c5458135c57dfe88d1107aea075c26f/src/python/TaskWorker/Actions/PreDAG.py#L201 to
estimates = set(self.completedJobs(stage='processing', processFailed=True))
which basically forces submission of a tail job with same config. as the processing one (OK, since the failure was an accidental 8028).
But I am still worried that making the change in the code for everybody may trigger problems in different situations which I can not imagine/test no.
Maybe there are situations where processing jobs fail, but still produce a report ? E.g. if they hit the time limit ? https://github.com/dmwm/CRABServer/blob/3b2e705c8c5458135c57dfe88d1107aea075c26f/scripts/TweakPSet.py#L209-L212
Or will they count as successful ?
that processFailed
argument was introduced in https://github.com/dmwm/CRABServer/commit/90c08419c59254464a10699415aba61419cb244e
No comments. no issue.
I am still unsure what to do.
some (but not all) probe jobs failing and all processing jobs failing is all in all a very rare case. Also fixing this does not help the CI pipeline to complete becasue current code still sees one failed job (the processing one) and keeps trying to resubmit. Will have to fix the pipeline first to properly deal with automatic splitting: https://github.com/dmwm/CRABServer/issues/8794
I have prepared a PR with that fix. But need to think more about possible side effects
I found this while looking at stuck automatic task in the CI pipeline https://cmsweb-testbed.cern.ch/crabserver/ui/task/241113_203248%3Acrabint1_crab_20241113_213248