Closed belforte closed 1 year ago
Findings from looking at code in https://github.com/dmwm/CRABServer/blob/master/scripts/task_process/RUCIO_Transfers.py
lfn2pfn
mapping code here instead of using Rucio's lfns2pfns
?filetransfersdb
table when inserting new files there.task_process/transfers.txt
is opened and read 4 timesbugs found
/
is there already in publishname
also g.logs_dataset
is defined alreadytask_process/transfers/last_transfer.txt
is updated at the end of register_replicas
but before info is stored in CRAB's DB transfersdb table. Are we protected against a script crash at this point ?{lfn:id}
is a map from LFN to IDs (lfn2id) not viceversa steps to implement the needed functionality
after discussion with @dciangot , some code reformatting is needed to implement internal bookkeeping as now sketched in the wiki
that overrides all above concepts/questions
work in progress on this is in https://github.com/belforte/CRABServer/tree/fill-block-info-from-RT-fix-7524
Items Stefano and Wa were discussed on May 10:
T2_DE_DESY
Rucio expect <prefix>/store/temp
but lfn2pfns
return <prefix>/temp
(at least it is valid path).T2_UK_SGrid_Bristol
register replicas report no protocol available for T2_UK_SGrid_Bristol_Temp
where we get protocol info from T2_UK_SGrid_Bristol
.Next to do list for RUCIO_Transfers.py
RUCIO_Transfers.py
script when files checksum is incorrect.add_replicas
to try again instead of fail them (It should not failed on this step).T2_DE_DESY
and T2_UK_SGrid_Bristol
.Some tasks are moved to https://github.com/dmwm/CRABServer/issues/7632#issue-1707380092
I can not find the place in above lists, but IIRC the action on me was to explains how and when RUCIO_Transfers.py stops running. This is controlled in these lines:
https://github.com/dmwm/CRABServer/blob/18af83f510e5ff8579ad30a5073cc97af9c060a9/scripts/task_process/task_proc_wrapper.sh#L36-L51
combined with the while True
look at the bottom of the script.
In other words: the task_process will execute RUCIO_Transfers.py every 5 minutes until all the PostJob have completed. If we want to keep waiting and testing, the PostJob has to wait
How much we wait is a classAd in the job https://github.com/dmwm/CRABServer/blob/18af83f510e5ff8579ad30a5073cc97af9c060a9/src/python/TaskWorker/Actions/PostJob.py#L2543-L2545
The value of ASOTimeout if set for the taks by https://github.com/dmwm/CRABServer/blob/18af83f510e5ff8579ad30a5073cc97af9c060a9/src/python/TaskWorker/Actions/DagmanCreator.py#L471-L475
Using values from TaskWorker congiguration We currently have a 7 days timeout in PostJob when using Rucio as per https://gitlab.cern.ch/ai/it-puppet-hostgroup-vocmsglidein/-/blob/master/code/templates/crabtaskworker/taskworker/TaskWorkerConfig.py.erb#L84-87
One more thing:
crab resubmit
the DAG is restarted and a new task_proc_wrapper.sh
is started (as a condor job in local universe, jobuniverse==12
) and again executes until the DAG completes (or the task_proc_wrapper.sh
script crashes).Thanks Stefano. I will write this somewhere in our docs.
One question: how do you want to proceed on my stale PR #7587 ? Do you want to review it by yourself or review it together in zoom chat (one Actions class at a time, total 3 Actions classes)?
I prefer to have someone crosscheck my code before merging it (It will merge to feature branch rucio_transfers
in dmwm/CRABServer
repo, but in the end, it will get merged to master). Then, I can close this issue and move on to https://github.com/dmwm/CRABServer/issues/7632 .
now that updateRucioInfo API is available in REST https://github.com/dmwm/CRABServer/blob/2cbd932bd64724dcbcca915f21665af1d9b5a08f/src/python/CRABInterface/RESTFileTransfers.py#L185-L198
usage example in https://github.com/belforte/utils/blob/master/notebooks/TestTransfersdb.ipynb
And from https://github.com/dmwm/CRABServer/wiki/ASO-via-Rucio the plan is:
block_complete
toYES