dmwm / CRABServer

15 stars 37 forks source link

Rucio ASO stuck due to files disappeared in /store/temp #8416

Closed belforte closed 1 month ago

belforte commented 1 month ago

as discussed in today's meeting. I saw for the first time a situation where files disappeared from /store/temp before Rucio could copy them. This happened running Jenkins StatusTracking on test2 instance

Task Rule Grafana
240515_152305:cmsbot_crab_rucio_transfers_20240515_172303 904469b342f24ee593164ba6b390f7f2 jobs
240515_152307:cmsbot_crab_rucio_transfers_group_20240515_172306 a83780c4d6ae4de5bfab45d89869c75b jobs
240515_152310:cmsbot_crab_rucio_transfers_manyedm_nopublication_20240515_172308 0b84179ad9664456b908509a0a1e2638 jobs
belforte commented 1 month ago

things are not straightforward

task 3 miss 4 files, 2 from each job 22 and 28. Yet they appear to be at GRIF allright

belforte@lxplus811/~> gfal-ls -l  davs://eos.grif.fr:11000/eos/grif/cms/grif/store/temp/user/cmsbot.a89d85b5d3b5e016ca3701eebcb42631cde823ba/cmsbot/ruciotransfers-1715786588/GenericTTbar/ruciotransfers-1715786588/240515_152310/0000
-rwxrwxrwx   0 0     0        633473 May 15 17:36 output_22.root    
-rwxrwxrwx   0 0     0        631698 May 15 17:36 output_28.root    
-rwxrwxrwx   0 0     0        633492 May 15 17:36 secondoutput_22.root  
-rwxrwxrwx   0 0     0        631723 May 15 17:36 secondoutput_28.root  
belforte@lxplus811/~> 

maybe a matter of waiting ?

task1 misses 6 files, from jobs 9,21,23,33,37,38 and those are there as well !! Even if I can swear that they were missing yesterday. Maybe some disk at T2_FR_GRIF went off/on line :-( . Maybe they have not mastered EOS operations yet.

belforte@lxplus811/~> gfal-ls -l davs://eos.grif.fr:11000/eos/grif/cms/grif/store/temp/user/cmsbot.a89d85b5d3b5e016ca3701eebcb42631cde823ba/cmsbot/ruciotransfers-1715786583/GenericTTbar/ruciotransfers-1715786583/240515_152305/0000
-rwxrwxrwx   0 0     0        631713 May 15 17:35 output_21.root    
-rwxrwxrwx   0 0     0        631666 May 15 17:36 output_23.root    
-rwxrwxrwx   0 0     0        631808 May 15 17:36 output_33.root    
-rwxrwxrwx   0 0     0        633229 May 15 17:36 output_37.root    
-rwxrwxrwx   0 0     0        631446 May 15 17:35 output_38.root    
-rwxrwxrwx   0 0     0        628728 May 15 17:36 output_9.root 
belforte@lxplus811/~> 

task2 is more tricky since it misses files from both T2_FR_GRIF and T2_IT_Legnaro and sure enough, 18 files are missing and 18 files are there waiting to be copied at those sites


belforte@lxplus811/~> gfal-ls -l  davs://t2-xrdcms.lnl.infn.it:2880/pnfs/lnl.infn.it/data/cms/store/temp/user/cmsbot.a89d85b5d3b5e016ca3701eebcb42631cde823ba/rucio/crab_test/ruciotransfers-1715786586/GenericTTbar/ruciotransfers-1715786586/240515_152307/0000/              
-rwxrwxrwx   0 0     0        631962 May 15 17:34 output_10.root    
-rwxrwxrwx   0 0     0        629664 May 15 17:34 output_13.root    
-rwxrwxrwx   0 0     0        631091 May 15 17:34 output_15.root    
-rwxrwxrwx   0 0     0        631183 May 15 17:35 output_26.root    
-rwxrwxrwx   0 0     0        632556 May 15 17:36 output_31.root    
-rwxrwxrwx   0 0     0        634056 May 15 17:34 output_35.root    
-rwxrwxrwx   0 0     0        631446 May 15 17:34 output_38.root    
-rwxrwxrwx   0 0     0        633409 May 15 17:36 output_6.root 
belforte@lxplus811/~> gfal-ls -l davs://eos.grif.fr:11000/eos/grif/cms/grif/store/temp/user/cmsbot.a89d85b5d3b5e016ca3701eebcb42631cde823ba/rucio/crab_test/ruciotransfers-1715786586/GenericTTbar/ruciotransfers-1715786586/240515_152307/0000
-rwxrwxrwx   0 0     0        631057 May 15 17:34 output_12.root    
-rwxrwxrwx   0 0     0        637049 May 15 17:34 output_14.root    
-rwxrwxrwx   0 0     0        632327 May 15 17:34 output_18.root    
-rwxrwxrwx   0 0     0        634760 May 15 17:34 output_2.root 
-rwxrwxrwx   0 0     0        631713 May 15 17:35 output_21.root    
-rwxrwxrwx   0 0     0        630670 May 15 17:35 output_27.root    
-rwxrwxrwx   0 0     0        635035 May 15 17:34 output_29.root    
-rwxrwxrwx   0 0     0        633229 May 15 17:35 output_37.root    
-rwxrwxrwx   0 0     0        630332 May 15 17:34 output_4.root 
-rwxrwxrwx   0 0     0        628728 May 15 17:34 output_9.root 
belforte@lxplus811/~> 

maybe Rucio will try again later on and fix them all ? The error reported for task2 is a more ominous RequestErrMsg.TRANSFER_FAILED:TRANSFER ERROR: Copy failed (3rd pull, 3rd push). Last attempt: copy HTTP 500 : Unexpected server error: 500 but it could simply mean that the storage server had a temporary issue. Rucio does not tell which file it refers to, nor the FTS job-id :-(

belforte commented 1 month ago

@novicecpp I assign to you since this morning you asked for the task names, but I do not thing there's anything to be done. Maybe we can try to bump the rucio rules into "try now", but it may require admin privilege.

novicecpp commented 1 month ago

1 and 3 are now done. 2 still stuck with a 500 error. For example:

davs://t2-xrdcms.lnl.infn.it:2880/pnfs/lnl.infn.it/data/cms/store/temp/user/cmsbot.a89d85b5d3b5e016ca3701eebcb42631cde823ba/rucio/crab_test/ruciotransfers-1715786586/GenericTTbar/ruciotransfers-1715786586/240515_152307/0000/output_38.root

File is accessible with corrected checksum. But, I do not know how/where to get the FTS error logs.

belforte commented 1 month ago

thanks for checking. I know how, but it is a pain, one needs to get it ouf of OpenSearch. Rucio should provide this :-( They have been talking about a new UI for years, I was waiting to see what it looks like before asking for a new feature :-( DataOps is not interested-in/capable-of tracking single failures so I am not sure investing on some tool which digs in OpenSearch/HDFS automatically is worth :-( Yet some general way of making an OS query from local python instead of spending 10min in the WebUI could be nice and maybe they set it up already. If you or Dario are interested, you can ping Nikodemas

belforte commented 1 month ago

here's FTS log for one of the files which are still failing https://fts-cms-007.cern.ch:8449/var/log/fts3/transfers/2024-05-20/eos.grif.fr__t2-xrdcms.lnl.infn.it/2024-05-20-1439__eos.grif.fr__t2-xrdcms.lnl.infn.it__4231741179__d08b8312-16b6-11ef-9057-fa163e7dd35e

shows same error as #8417


INFO    Mon, 20 May 2024 16:39:54 +0200; Davix: Negative result for operation: HTTP 404 : File not found . After 1 retry
INFO    Mon, 20 May 2024 16:39:54 +0200; [1716215994457] DEST http_plugin   CLEANUP 0
INFO    Mon, 20 May 2024 16:39:54 +0200; Gfal2: Event triggered: DESTINATION http_plugin CLEANUP 0
INFO    Mon, 20 May 2024 16:39:54 +0200; [1716215994457] BOTH http_plugin   TRANSFER:EXIT   ERROR: Copy failed (3rd pull, 3rd push). Last attempt: Transfer failure: Peer certificate cannot be authenticated with given CA certificates
INFO    Mon, 20 May 2024 16:39:54 +0200; Gfal2: Event triggered: BOTH http_plugin TRANSFER:EXIT ERROR: Copy failed (3rd pull, 3rd push). Last attempt: Transfer failure: Peer certificate cannot be authenticated with given CA certificates
ERR     Mon, 20 May 2024 16:39:54 +0200; Recoverable error: [5] TRANSFER ERROR: Copy failed (3rd pull, 3rd push). Last attempt: Transfer failure: Peer certificate cannot be authenticated with given CA certificates
belforte commented 1 month ago

problems also show up in FTS dashboard https://monit-grafana.cern.ch/d/mtQFDScGk/cms-fts-metrics?orgId=11&refresh=5m&var-vo=cms&var-group_by=activity&var-src_rse=T2_IT_Legnaro&var-src_rse=T2_IT_Legnaro_Temp&var-src_rse=T2_FR_GRIF&var-dst_rse=T2_IT_Legnaro&var-fts_server=All&var-activity=All&var-protocol=All&var-auth_method=All&var-bin=1h

but nobody wlll follow up

belforte commented 1 month ago

since we need transfer to T2_IT_Legnaro to work for our tests, I have created a GGUS ticket

belforte commented 1 month ago

T2_IT_Legnaro fixed a misconfiguration and those transfers are now done. In conclusion:

belforte commented 1 month ago

after brief discussion in crab meeting:

WIll more to ad hoc issue : #8429

belforte commented 1 month ago

action items listed above are tracked in #8429

closing