Closed belforte closed 1 month ago
things are not straightforward
task 3 miss 4 files, 2 from each job 22 and 28. Yet they appear to be at GRIF allright
belforte@lxplus811/~> gfal-ls -l davs://eos.grif.fr:11000/eos/grif/cms/grif/store/temp/user/cmsbot.a89d85b5d3b5e016ca3701eebcb42631cde823ba/cmsbot/ruciotransfers-1715786588/GenericTTbar/ruciotransfers-1715786588/240515_152310/0000
-rwxrwxrwx 0 0 0 633473 May 15 17:36 output_22.root
-rwxrwxrwx 0 0 0 631698 May 15 17:36 output_28.root
-rwxrwxrwx 0 0 0 633492 May 15 17:36 secondoutput_22.root
-rwxrwxrwx 0 0 0 631723 May 15 17:36 secondoutput_28.root
belforte@lxplus811/~>
maybe a matter of waiting ?
task1 misses 6 files, from jobs 9,21,23,33,37,38 and those are there as well !! Even if I can swear that they were missing yesterday. Maybe some disk at T2_FR_GRIF went off/on line :-( . Maybe they have not mastered EOS operations yet.
belforte@lxplus811/~> gfal-ls -l davs://eos.grif.fr:11000/eos/grif/cms/grif/store/temp/user/cmsbot.a89d85b5d3b5e016ca3701eebcb42631cde823ba/cmsbot/ruciotransfers-1715786583/GenericTTbar/ruciotransfers-1715786583/240515_152305/0000
-rwxrwxrwx 0 0 0 631713 May 15 17:35 output_21.root
-rwxrwxrwx 0 0 0 631666 May 15 17:36 output_23.root
-rwxrwxrwx 0 0 0 631808 May 15 17:36 output_33.root
-rwxrwxrwx 0 0 0 633229 May 15 17:36 output_37.root
-rwxrwxrwx 0 0 0 631446 May 15 17:35 output_38.root
-rwxrwxrwx 0 0 0 628728 May 15 17:36 output_9.root
belforte@lxplus811/~>
task2 is more tricky since it misses files from both T2_FR_GRIF and T2_IT_Legnaro and sure enough, 18 files are missing and 18 files are there waiting to be copied at those sites
belforte@lxplus811/~> gfal-ls -l davs://t2-xrdcms.lnl.infn.it:2880/pnfs/lnl.infn.it/data/cms/store/temp/user/cmsbot.a89d85b5d3b5e016ca3701eebcb42631cde823ba/rucio/crab_test/ruciotransfers-1715786586/GenericTTbar/ruciotransfers-1715786586/240515_152307/0000/
-rwxrwxrwx 0 0 0 631962 May 15 17:34 output_10.root
-rwxrwxrwx 0 0 0 629664 May 15 17:34 output_13.root
-rwxrwxrwx 0 0 0 631091 May 15 17:34 output_15.root
-rwxrwxrwx 0 0 0 631183 May 15 17:35 output_26.root
-rwxrwxrwx 0 0 0 632556 May 15 17:36 output_31.root
-rwxrwxrwx 0 0 0 634056 May 15 17:34 output_35.root
-rwxrwxrwx 0 0 0 631446 May 15 17:34 output_38.root
-rwxrwxrwx 0 0 0 633409 May 15 17:36 output_6.root
belforte@lxplus811/~> gfal-ls -l davs://eos.grif.fr:11000/eos/grif/cms/grif/store/temp/user/cmsbot.a89d85b5d3b5e016ca3701eebcb42631cde823ba/rucio/crab_test/ruciotransfers-1715786586/GenericTTbar/ruciotransfers-1715786586/240515_152307/0000
-rwxrwxrwx 0 0 0 631057 May 15 17:34 output_12.root
-rwxrwxrwx 0 0 0 637049 May 15 17:34 output_14.root
-rwxrwxrwx 0 0 0 632327 May 15 17:34 output_18.root
-rwxrwxrwx 0 0 0 634760 May 15 17:34 output_2.root
-rwxrwxrwx 0 0 0 631713 May 15 17:35 output_21.root
-rwxrwxrwx 0 0 0 630670 May 15 17:35 output_27.root
-rwxrwxrwx 0 0 0 635035 May 15 17:34 output_29.root
-rwxrwxrwx 0 0 0 633229 May 15 17:35 output_37.root
-rwxrwxrwx 0 0 0 630332 May 15 17:34 output_4.root
-rwxrwxrwx 0 0 0 628728 May 15 17:34 output_9.root
belforte@lxplus811/~>
maybe Rucio will try again later on and fix them all ?
The error reported for task2 is a more ominous RequestErrMsg.TRANSFER_FAILED:TRANSFER ERROR: Copy failed (3rd pull, 3rd push). Last attempt: copy HTTP 500 : Unexpected server error: 500
but it could simply mean that the storage server had a temporary issue. Rucio does not tell which file it refers to, nor the FTS job-id :-(
@novicecpp I assign to you since this morning you asked for the task names, but I do not thing there's anything to be done. Maybe we can try to bump the rucio rules into "try now", but it may require admin privilege.
1 and 3 are now done. 2 still stuck with a 500 error. For example:
davs://t2-xrdcms.lnl.infn.it:2880/pnfs/lnl.infn.it/data/cms/store/temp/user/cmsbot.a89d85b5d3b5e016ca3701eebcb42631cde823ba/rucio/crab_test/ruciotransfers-1715786586/GenericTTbar/ruciotransfers-1715786586/240515_152307/0000/output_38.root
File is accessible with corrected checksum. But, I do not know how/where to get the FTS error logs.
thanks for checking. I know how, but it is a pain, one needs to get it ouf of OpenSearch. Rucio should provide this :-( They have been talking about a new UI for years, I was waiting to see what it looks like before asking for a new feature :-( DataOps is not interested-in/capable-of tracking single failures so I am not sure investing on some tool which digs in OpenSearch/HDFS automatically is worth :-( Yet some general way of making an OS query from local python instead of spending 10min in the WebUI could be nice and maybe they set it up already. If you or Dario are interested, you can ping Nikodemas
here's FTS log for one of the files which are still failing https://fts-cms-007.cern.ch:8449/var/log/fts3/transfers/2024-05-20/eos.grif.fr__t2-xrdcms.lnl.infn.it/2024-05-20-1439__eos.grif.fr__t2-xrdcms.lnl.infn.it__4231741179__d08b8312-16b6-11ef-9057-fa163e7dd35e
shows same error as #8417
INFO Mon, 20 May 2024 16:39:54 +0200; Davix: Negative result for operation: HTTP 404 : File not found . After 1 retry
INFO Mon, 20 May 2024 16:39:54 +0200; [1716215994457] DEST http_plugin CLEANUP 0
INFO Mon, 20 May 2024 16:39:54 +0200; Gfal2: Event triggered: DESTINATION http_plugin CLEANUP 0
INFO Mon, 20 May 2024 16:39:54 +0200; [1716215994457] BOTH http_plugin TRANSFER:EXIT ERROR: Copy failed (3rd pull, 3rd push). Last attempt: Transfer failure: Peer certificate cannot be authenticated with given CA certificates
INFO Mon, 20 May 2024 16:39:54 +0200; Gfal2: Event triggered: BOTH http_plugin TRANSFER:EXIT ERROR: Copy failed (3rd pull, 3rd push). Last attempt: Transfer failure: Peer certificate cannot be authenticated with given CA certificates
ERR Mon, 20 May 2024 16:39:54 +0200; Recoverable error: [5] TRANSFER ERROR: Copy failed (3rd pull, 3rd push). Last attempt: Transfer failure: Peer certificate cannot be authenticated with given CA certificates
problems also show up in FTS dashboard https://monit-grafana.cern.ch/d/mtQFDScGk/cms-fts-metrics?orgId=11&refresh=5m&var-vo=cms&var-group_by=activity&var-src_rse=T2_IT_Legnaro&var-src_rse=T2_IT_Legnaro_Temp&var-src_rse=T2_FR_GRIF&var-dst_rse=T2_IT_Legnaro&var-fts_server=All&var-activity=All&var-protocol=All&var-auth_method=All&var-bin=1h
but nobody wlll follow up
since we need transfer to T2_IT_Legnaro to work for our tests, I have created a GGUS ticket
T2_IT_Legnaro fixed a misconfiguration and those transfers are now done. In conclusion:
stuck transfers from T2_IT_Legnaro_temp to T2_IT_Legnaro: required local admin action
I think that we can stick with "wait a week and then kill". Maybe give user a way to extend the 7dasy in case there's ongoing work with site admins ? Maybe extend the timeout and if user is fed up with waiting they can kill ? Need to make sure that crab kill stops everything !
after brief discussion in crab meeting:
WIll more to ad hoc issue : #8429
action items listed above are tracked in #8429
closing
as discussed in today's meeting. I saw for the first time a situation where files disappeared from /store/temp before Rucio could copy them. This happened running Jenkins StatusTracking on
test2
instance