Open pdeperio opened 7 years ago
Still happening now, including a third stage AddChecksum
, first try deletes the file and DB entry:
$ HOSTNAME=midway-login1 cax --config /project/lgrandi/xenon1t/cax/cax_AddChecksum_only.json --once --run 14576
14576 midway-login1
root : INFO Using custom config file: /project/lgrandi/xenon1t/cax/cax_AddChecksum_only.json
root : INFO Executing RetryStalledTransfer.
root : INFO Executing RetryBadChecksumTransfer.
RetryBadChecksumTransfer: ERROR Bad checksum 14576, midway-login1, processed
RetryBadChecksumTransfer: ERROR Bad checksum v6.8.0
RetryBadChecksumTransfer: INFO Deleting /project2/lgrandi/xenon1t/processed/pax_v6.8.0/171118_0702.root
RetryBadChecksumTransfer: INFO Removed from run database: /project2/lgrandi/xenon1t/processed/pax_v6.8.0/171118_0702.root
root : INFO Executing CopyPull.
CopyPull : INFO rsync download dataset 171118_0702.root took 0 seconds
root : INFO Executing AddChecksum.
root : INFO Executing SetPermission.
root : INFO Executing ProcessBatchQueueHax.
Second try to get CopyPull
:
$ HOSTNAME=midway-login1 cax --config /project/lgrandi/xenon1t/cax/cax_AddChecksum_only.json --once --run 14576
14576 midway-login1
root : INFO Using custom config file: /project/lgrandi/xenon1t/cax/cax_AddChecksum_only.json
root : INFO Executing RetryStalledTransfer.
root : INFO Executing RetryBadChecksumTransfer.
root : INFO Executing CopyPull.
CopyPull : INFO downloading run 14576 to: midway-login1
CopyPull : INFO {'creation_time': datetime.datetime(2017, 11, 19, 3, 50, 14, 367000), 'status': 'transferred', 'type': 'processed', 'checksum': 'b678b2f66ac193e6feb58d0a11b4b9bb9bd294ac1ac98dbf4a5a22f05e48f483ff6ab0d098fed64009c6672fab4d74ca77186df3ad746df846a099ca1fe9d0be', 'host': 'login', 'location': '/xenon/xenon1t_processed/pax_v6.8.0/171118_0702.root', 'creation_place': 'OSG', 'pax_version': 'v6.8.0'}
CopyPull : INFO Starting rsync
root : INFO download: login.xenon.ci-connect.net/xenon/xenon1t_processed/pax_v6.8.0/171118_0702.root to /project2/lgrandi/xenon1t/processed/pax_v6.8.0/171118_0702.root
CopyPull : INFO time rsync -r --stats pdeperio@login.xenon.ci-connect.net:/xenon/xenon1t_processed/pax_v6.8.0/171118_0702.root /project2/lgrandi/xenon1t/processed/pax_v6.8.0
root : INFO End of download
CopyPull : INFO rsync download dataset 171118_0702.root took 153 seconds
root : INFO Executing AddChecksum.
root : INFO Executing SetPermission.
And third try to get AddChecksum
:
$ HOSTNAME=midway-login1 cax --config /project/lgrandi/xenon1t/cax/cax_AddChecksum_only.json --once --run 14576
14576 midway-login1
root : INFO Using custom config file: /project/lgrandi/xenon1t/cax/cax_AddChecksum_only.json
root : INFO Executing RetryStalledTransfer.
root : INFO Executing RetryBadChecksumTransfer.
root : INFO Executing CopyPull.
CopyPull : INFO rsync download dataset 171118_0702.root took 0 seconds
root : INFO Executing AddChecksum.
AddChecksum : INFO Adding a checksum to run 14576 processed
root : INFO Executing SetPermission.
This really slows things down, requiring 3 full cycles of massive-cax
to actually get through the whole task list. It seems as if the DB entry is not being re-queried for each task, but @tunnell I thought you said it is? Could you help check please?
Running with
"task_list": ["RetryStalledTransfer", "CopyPull"]
clears the error successfully but does not actually continue to re-download it:Then running the same command again, works:
This slows down transfer recovery by one periodic cycle of
massive-cax
.