dmwm / CRABServer

15 stars 37 forks source link

AdjustSites.py race condition #8411

Closed novicecpp closed 1 month ago

novicecpp commented 1 month ago

There is race condition where AdjustSites.py#L393 is executed before TW could set task status to SUBMITTED

For example, with this task 240514_130645:crabint1_crab_rucio_transfers_20240514_150644. The task status msg reported boostrap failure:

The task failed to bootstrap on the Grid scheduler. Crab resubmit will not work. Please send an e-mail to cmstalk+computing-tools@dovecotmta.cern.ch.
Hold reason: 'This dagman does not match task information in TASKS DB'

When look at the adjust_out.txt at vocms059:/data/srv/glidecondor/condor_local/spool/4987/0/cluster9594987.proc0.subproc0/adjust_out.txt, the logs reported:

2024-05-14 13:06:59.196789: Task status on DB: QUEUED, clusterID on DB: None, schedd name on DB: crab3@vocms059.cern.ch; 
clusterID on condor ads: 9594987, schedd name on condor ads: crab3@vocms059.cern.ch 
2024-05-14 13:06:59.196845: Exiting AdjustSites because this dagman does not match task information in TASKS DB

The task log from TW (crab-dev-tw03:/data/container/TaskWorker/logs/tasks/crabint1/240514_130645:crabint1_crab_rucio_transfers_20240514_150644.log):

2024-05-14 13:06:56,473:DEBUG:DagmanSubmitter:Trying to submit task 240514_130645:crabint1_crab_rucio_transfers_20240514_150644 to schedd crab3@vocms059.cern.ch for the 0 time.
2024-05-14 13:06:56,473:DEBUG:DagmanSubmitter:Getting schedd object
2024-05-14 13:06:56,479:DEBUG:DagmanSubmitter:Got schedd object
2024-05-14 13:06:56,511:INFO:DagmanSubmitter:User crabint1 mapped to local groups {'/cms', '/cms/production'}.
2024-05-14 13:06:56,511:DEBUG:DagmanSubmitter:Finally submitting to the schedd
2024-05-14 13:06:57,266:DEBUG:DagmanSubmitter:Condor cluster ID returned from submit is: 9594987
2024-05-14 13:06:57,267:DEBUG:DagmanSubmitter:Submission finished
2024-05-14 13:06:57,267:DEBUG:DagmanSubmitter:Pushing information centrally {'workflow': '240514_130645:crabint1_crab_rucio_transfers_20240514_150644', 'status': 'SUBMITTED', 'subresource': 'success', 'clusterid': 9594987}
2024-05-14 13:07:00,002:INFO:DagmanSubmitter:Summary of schedd failures/successes for slave process 1 (PID 22):
        crab3@vocms059.cern.ch:
                Number of successes: 1
                Number of failures:  0
Last successful submission: ClusterId 9594987 submitted to schedd crab3@vocms059.cern.ch

Looking at timestamp, there are 3 seconds gap between dag submitted and update task status to rest, and AdjustSites.py error's timestamp is between this gap.

It never occurred to me when submitting with Jenkins or manual, but it happened occasionally when running the Gitlab-CI pipeline. It was very frustrating.

Should we add retry in AdjustSites or simply delay boostrap for 1 min?

belforte commented 1 month ago

Puzzling. It should be extremely rare that AdjustSites.py runs within a few seconds from submission. I have seen this in user reports rarely and usually in situation where uploading to REST fails and needs to be retried. A small delay like you suggest would not fix that.

belforte commented 1 month ago

Let me review the original motivation for this in #6151 and #6145

At first sight the only really safe way is (as usual) the one Brian suggested in #6151, but it complicates TW a lot.

OTOH, we have discussed to introduce new task statuses already: WAITING between NEW and HOLDING (which means "grabbed by TW") and COMPLETED/FAILED to signal that DAGMAN stopped #8394

So maybe adding ACTIVE (or call it RUNNING) is not too bad.

I propose to add a 30sec/1min sleep to avoid the "next condor schedd cycle happens instantly as soon as dag is submitted" and re-evaluate the above proposal as a better alternative to current check in AdjustSites