XENON1T / cax

Simple data management tool
ISC License
1 stars 2 forks source link

cax skips actual rsync if error'ed DB cleared #109

Open pdeperio opened 7 years ago

pdeperio commented 7 years ago

Running with "task_list": ["RetryStalledTransfer", "CopyPull"] clears the error successfully but does not actually continue to re-download it:

(pax_v6.6.5) [mklinton@midway2-login1 cax_pdp]$  HOSTNAME=midway-login1 cax --once --config cax_CopyPull.json  --log DEBUG  --run 6736
6736 midway-login1
root        : INFO     Using custom config file: cax_CopyPull.json
root        : INFO     Executing RetryStalledTransfer.
RetryStalledTransfer: ERROR    Transfer or process errored, retry.
RetryStalledTransfer: INFO     Deleting /project/lgrandi/xenon1t/processed/pax_v6.6.5/170202_2248.root
RetryStalledTransfer: ERROR    did not exist, notify run database.
RetryStalledTransfer: INFO     Removed from run database: /project/lgrandi/xenon1t/processed/pax_v6.6.5/170202_2248.root
root        : INFO     Executing CopyPull.
CopyPull    : INFO     rsync download dataset 170202_2248.root took 0 seconds

Then running the same command again, works:

(pax_v6.6.5) [mklinton@midway2-login1 cax_pdp]$  HOSTNAME=midway-login1 cax --once --config cax_CopyPull.json  --log DEBUG  --run 6736
6736 midway-login1
root        : INFO     Using custom config file: cax_CopyPull.json
root        : INFO     Executing RetryStalledTransfer.
root        : INFO     Executing CopyPull.
CopyPull    : INFO     downloading run 6736 to: midway-login1
CopyPull    : INFO     {'location': '/xenon/xenon1t_processed/pax_v6.6.5/170202_2248.root', 'creation_place': 'OSG', 'status': 'transferred', 'pax_version': 'v6.6.5', 'checksum': 'e1794299ba08041ebb150bc16cd75179468f9fbebaccc72c3889407f7d49c0cb6d8f8c5bbbef8f5d24ad90a0d2cd16d505df66e06fce2631d90dd4668e259b92', 'creation_time': [datetime.datetime(2017, 5, 26, 21, 5, 35, 407000)], 'host': 'login', 'type': 'processed'}
CopyPull    : INFO     Starting rsync
root        : INFO     download: login.ci-connect.uchicago.edu/xenon/xenon1t_processed/pax_v6.6.5/170202_2248.root to /project/lgrandi/xenon1t/processed/pax_v6.6.5/170202_2248.root
CopyPull    : INFO     time rsync -r --stats pdeperio@login.ci-connect.uchicago.edu:/xenon/xenon1t_processed/pax_v6.6.5/170202_2248.root /project/lgrandi/xenon1t/processed/pax_v6.6.5
CopyPull    : INFO         
Number of files: 1
Number of files transferred: 1
Total file size: 1854776707 bytes

This slows down transfer recovery by one periodic cycle of massive-cax.

pdeperio commented 6 years ago

Still happening now, including a third stage AddChecksum, first try deletes the file and DB entry:

$ HOSTNAME=midway-login1  cax --config /project/lgrandi/xenon1t/cax/cax_AddChecksum_only.json --once --run 14576
14576 midway-login1
root        : INFO     Using custom config file: /project/lgrandi/xenon1t/cax/cax_AddChecksum_only.json
root        : INFO     Executing RetryStalledTransfer.
root        : INFO     Executing RetryBadChecksumTransfer.
RetryBadChecksumTransfer: ERROR    Bad checksum 14576, midway-login1, processed
RetryBadChecksumTransfer: ERROR    Bad checksum v6.8.0
RetryBadChecksumTransfer: INFO     Deleting /project2/lgrandi/xenon1t/processed/pax_v6.8.0/171118_0702.root
RetryBadChecksumTransfer: INFO     Removed from run database: /project2/lgrandi/xenon1t/processed/pax_v6.8.0/171118_0702.root
root        : INFO     Executing CopyPull.
CopyPull    : INFO     rsync download dataset 171118_0702.root took 0 seconds
root        : INFO     Executing AddChecksum.
root        : INFO     Executing SetPermission.
root        : INFO     Executing ProcessBatchQueueHax.

Second try to get CopyPull:

$ HOSTNAME=midway-login1  cax --config /project/lgrandi/xenon1t/cax/cax_AddChecksum_only.json --once --run 14576
14576 midway-login1
root        : INFO     Using custom config file: /project/lgrandi/xenon1t/cax/cax_AddChecksum_only.json
root        : INFO     Executing RetryStalledTransfer.
root        : INFO     Executing RetryBadChecksumTransfer.
root        : INFO     Executing CopyPull.
CopyPull    : INFO     downloading run 14576 to: midway-login1
CopyPull    : INFO     {'creation_time': datetime.datetime(2017, 11, 19, 3, 50, 14, 367000), 'status': 'transferred', 'type': 'processed', 'checksum': 'b678b2f66ac193e6feb58d0a11b4b9bb9bd294ac1ac98dbf4a5a22f05e48f483ff6ab0d098fed64009c6672fab4d74ca77186df3ad746df846a099ca1fe9d0be', 'host': 'login', 'location': '/xenon/xenon1t_processed/pax_v6.8.0/171118_0702.root', 'creation_place': 'OSG', 'pax_version': 'v6.8.0'}
CopyPull    : INFO     Starting rsync
root        : INFO     download: login.xenon.ci-connect.net/xenon/xenon1t_processed/pax_v6.8.0/171118_0702.root to /project2/lgrandi/xenon1t/processed/pax_v6.8.0/171118_0702.root
CopyPull    : INFO     time rsync -r --stats pdeperio@login.xenon.ci-connect.net:/xenon/xenon1t_processed/pax_v6.8.0/171118_0702.root /project2/lgrandi/xenon1t/processed/pax_v6.8.0
root        : INFO     End of download

CopyPull    : INFO     rsync download dataset 171118_0702.root took 153 seconds
root        : INFO     Executing AddChecksum.
root        : INFO     Executing SetPermission.

And third try to get AddChecksum:

$ HOSTNAME=midway-login1  cax --config /project/lgrandi/xenon1t/cax/cax_AddChecksum_only.json --once --run 14576
14576 midway-login1
root        : INFO     Using custom config file: /project/lgrandi/xenon1t/cax/cax_AddChecksum_only.json
root        : INFO     Executing RetryStalledTransfer.
root        : INFO     Executing RetryBadChecksumTransfer.
root        : INFO     Executing CopyPull.
CopyPull    : INFO     rsync download dataset 171118_0702.root took 0 seconds
root        : INFO     Executing AddChecksum.
AddChecksum : INFO     Adding a checksum to run 14576 processed
root        : INFO     Executing SetPermission.

This really slows things down, requiring 3 full cycles of massive-cax to actually get through the whole task list. It seems as if the DB entry is not being re-queried for each task, but @tunnell I thought you said it is? Could you help check please?