Closed novicecpp closed 1 month ago
Puzzling. It should be extremely rare that AdjustSites.py runs within a few seconds from submission. I have seen this in user reports rarely and usually in situation where uploading to REST fails and needs to be retried. A small delay like you suggest would not fix that.
Let me review the original motivation for this in #6151 and #6145
At first sight the only really safe way is (as usual) the one Brian suggested in #6151, but it complicates TW a lot.
OTOH, we have discussed to introduce new task statuses already: WAITING between NEW and HOLDING (which means "grabbed by TW") and COMPLETED/FAILED to signal that DAGMAN stopped #8394
So maybe adding ACTIVE (or call it RUNNING) is not too bad.
I propose to add a 30sec/1min sleep to avoid the "next condor schedd cycle happens instantly as soon as dag is submitted" and re-evaluate the above proposal as a better alternative to current check in AdjustSites
There is race condition where AdjustSites.py#L393 is executed before TW could set task status to
SUBMITTED
For example, with this task 240514_130645:crabint1_crab_rucio_transfers_20240514_150644. The task status msg reported boostrap failure:
When look at the
adjust_out.txt
atvocms059:/data/srv/glidecondor/condor_local/spool/4987/0/cluster9594987.proc0.subproc0/adjust_out.txt
, the logs reported:The task log from TW (
crab-dev-tw03:/data/container/TaskWorker/logs/tasks/crabint1/240514_130645:crabint1_crab_rucio_transfers_20240514_150644.log
):Looking at timestamp, there are 3 seconds gap between dag submitted and update task status to rest, and AdjustSites.py error's timestamp is between this gap.
It never occurred to me when submitting with Jenkins or manual, but it happened occasionally when running the Gitlab-CI pipeline. It was very frustrating.
Should we add retry in AdjustSites or simply delay boostrap for 1 min?