Open haozturk opened 9 months ago
I've added the suspicious pattern to the finisher config in the integration:
{"message": "Suspicious patterns: ['.*No such file or directory.*', '.*no such file or directory.*', '.*CHECKSUM MISMATCH Source and destination checksums do not match.*', '.*SOURCE CHECKSUM MISMATCH.*', '.*Unable to read file - wrong file checksum.*', '.*checksum verification failed.*', '.*direct_access.*', '.*Copy failed with mode 3rd pull', 'with error: Transfer failed: failure: Server returned nothing.*', '.*HTTP 404 : File not found.*']", "error": {"type": null, "message": null, "stack_trace": null}, "log": {"logger": "root", "level": "DEBUG"}, "process": {"pid": 9}, "@timestamp": "2024-01-19T12:26:05.379Z"}
Now working on producing relevant transfer errors to see whether finisher will be able to mark them as suspicious
I produced some stuck rules with checksum problems in the integration. The finisher is able to spot them, however not able to mark them as suspicious
{"log": {"logger": "root", "level": "DEBUG"}, "@timestamp": "2024-01-23T14:06:23.430Z", "process": {"pid": 9}, "message": "Checking suspicious file for request: 23c97c6470fd44d88d1dfa4abce0ffa9, transfer error: RequestErrMsg.TRANSFER_FAILED:TRANSFER [5] SOURCE CHECKSUM MISMATCH Source and user-defined ADLER32 do not match (6bc75436 != 18134cde)", "error": {"type": null, "message": null, "stack_trace": null}}
{"log": {"logger": "root", "level": "DEBUG"}, "@timestamp": "2024-01-23T14:06:23.433Z", "process": {"pid": 9}, "message": "Found suspicious urls: ['davs://eoscms.cern.ch:443/eos/cms/store/backfill/1/DMWM_Test/RelValSinglePiE50HCAL/GEN-SIM/TC_SLC7_CMS_Home_IDRv5d-v11/00000/99141D34-7C33-3747-B330-E16940FA6DAB.root']", "error": {"type": null, "message": null, "stack_trace": null}}
Problem seems to be in the get_pfn_to_rse
function [1] which cannot obtain the rse from the pfns. Looking deeper.
Okay, I confirm that the reason why conveyor-finisher cannot mark a replica suspicious in the bad_replicas table is this ATLAS specific code [1]
In particular the did name is computed incorrectly. For instance, the replicas variable gets the following value
[{'scope': store, 'name': '9E814597-78C2-CE44-BDAB-C1F333E4BAF8.root', 'rse_id': '0f87087236244c4ca443472498e0aef2', 'state': <BadFilesStatus.SUSPICIOUS: 'S'>}]
for the following pfn
davs://eoscms.cern.ch:443/eos/cms/store/temp/sitetest/store/mc/RunIISummer20UL17MiniAODv2/GluGluSpin0ToGammaGamma_W-0p014_M-650_TuneCP2_13TeV_pythia8/MINIAODSIM/106X_mc2017_realistic_v9BACKFILL-v3/00000/9E814597-78C2-CE44-BDAB-C1F333E4BAF8.root
where the name
of the replica is incorrect.
@ericvaandering @voetberg you mentioned that this is already known. Should we handle this as a 3rd step in [2] or is there already a ticket open for this in Rucio?
[1] https://github.com/rucio/rucio/blob/e819e3c001d8613076728ffd8f424fa3f2bdc812/lib/rucio/core/replica.py#L273 [2] https://github.com/rucio/rucio/issues/6396
@haozturk Yep, this is already known and there's an open PR with a solution. Issue https://github.com/rucio/rucio/issues/6363 and PR https://github.com/rucio/rucio/pull/6444. It's based on vo's instead of config settings but should have similar results.
Ah, thanks a lo @voetberg . I'll keep this issue in Waiting
then. I can re-test it once the PR is merged and installed in the integration cluster.
Hi @voetberg I patched integration cluster with your PR 6444 for testing and I see that we still can't mark replicas suspicious. This is the relevant snippet from the finisher logs:
{"@timestamp": "2024-05-23T14:45:50.229Z", "log": {"logger": "root", "level": "DEBUG"}, "message": "Checking suspicious file for request: 22cc26303cb9423f8d67f273e263e476, transfer error: RequestErrMsg.TRANSFER_FAILED:TRANSFER [93] Protocol not supported or path/url invalid: gsiftp://gftp.t2.ucsd.edu:2811/hadoop/cms/phedex/store/test/rucio/int/store/data/Run2016F/MuonEG/MINIAOD/HIPM_UL2016_MiniAODv2-v2/280000", "error": {"type": null, "message": null, "stack_trace": null}, "process": {"pid": 9}}
{"@timestamp": "2024-05-23T14:45:50.233Z", "log": {"logger": "root", "level": "DEBUG"}, "message": "Found suspicious urls: ['srm://cmsdcatape.fnal.gov:8443/srm/managerv2?SFN=/11/store/data/Run2016F/MuonEG/MINIAOD/HIPM_UL2016_MiniAODv2-v2/280000/B3FF92F9-855D-5144-BA28-3877560A93B2.root']", "error": {"type": null, "message": null, "stack_trace": null}, "process": {"pid": 9}}
{"@timestamp": "2024-05-23T14:45:50.253Z", "log": {"logger": "root", "level": "DEBUG"}, "message": "gfal.Default: parsing 1 pfns", "error": {"type": null, "message": null, "stack_trace": null}, "process": {"pid": 9}}
and this shows that we couldn't mark them suspicious:
$ rucio-int list-suspicious-replicas
RSE Expression: Scope: Created at: Nattempts: File Name:
----------------- -------- ------------- ------------ ------------
rucio-int is an alias to alias rucio-int="rucio -H http://cms-rucio-int.cern.ch --auth-host https://cms-rucio-auth-int.cern.ch"
Can you please help me understand what's the issue?
6444 only covers correctly parsing the scope and name in "declare bad replicas". Did you also patch with https://github.com/rucio/rucio/pull/6764 ?
6764 shouldn't be necessary to mark replicas suspicious. The suspicious replica recoveror daemon doesn't mark replicas suspicious. It processes the replicas that are marked suspicious and declares them bad when necessary. I'm testing out the parts which mark replicas suspicious which happens in two places in rucio: kronos and conveyor-finisher. In the context of this issue, I'm testing the latter and simply this line [1] doesn't work
[1] https://github.com/rucio/rucio/blob/master/lib/rucio/daemons/conveyor/finisher.py#L373
Ah okay I misunderstood you there - was looking for the obvious solution.
Then I'm not immediately sure, I'm as stumped as you are. Can you verify the test rse is deterministic? Or else I can jump in and start adding logging statements everywhere to see where this gets stuck
Thanks Maggie, yes it's deterministic:
$ rucio-admin-int rse info T1_US_FNAL_Disk | grep determ
deterministic: True
It'd indeed help adding more logging. Current logs don't help much. I couldn't figure out an easy way to add logging lines. Probably we need to add new log lines, make a temporary patch, deploy it and watch what's going on.
Hi @voetberg I've found what's wrong. scope
and name
extraction from pfn looks problematic. For this pfn
srm://cmsdcatape.fnal.gov/11/store/data/Run2016F/MuonEG/MINIAOD/HIPM_UL2016_MiniAODv2-v2/280000/B3FF92F9-855D-5144-BA28-3877560A93B2.root
it yields
name: B3FF92F9-855D-5144-BA28-3877560A93B2.root
scope: 11
whereas it should've been
scope = "cms"
name = "/store/data/Run2016F/MuonEG/MINIAOD/HIPM_UL2016_MiniAODv2-v2/280000/B3FF92F9-855D-5144-BA28-3877560A93B2.root
which is computed here: https://github.com/rucio/rucio/blob/cf9c2c7afad37fc5b7bdd782e281918b4230b2f1/lib/rucio/core/replica.py#L275
As far as I'm aware scope
cannot be extracted from PFN in CMS. Once you know the lfn and the rse_id, you can get it from the replicas table. If there are multiple replicas with the same name and rse_id but different scopes, we can default to cms
which is the scope we use for all centrally produced data. Maybe @ericvaandering has a better idea?
Does it sound reasonable? You want me to fix this upstream?
@haozturk The way the code is set up, you may be able to write a plugin that just returns "cms" for the scope for any pfn. Right now it's written so that it uses VO to determine the code to use in the scope from pfn. I think it would be as simple as patching in
class CMSScopeTranslation(RSEDeterministicScopeTranslation):
def __init__(self, vo: str = 'def'):
super().__init__()
self.register("cms", CMSScopeTranslation._cms) # <- Name here corresponds to vo
@static_method
def _cms(pfn:str) -> str:
return "cms"
Thanks @voetberg indeed this would be the most convenient. However, I'm a bit worried for the exceptions where scope isn't cms
. If @ericvaandering has no other suggestion, I can apply it.
Lastly, we need to change the way we extract the name/lfn, but that should be easy.
Igor did something similar here, https://github.com/rucio/rucio/pull/6350 Not sure if it's entirely the same though.
Context: https://indico.cern.ch/event/1356295/ Needed for https://github.com/dmwm/CMSRucio/issues/403
I suggest to put the following patterns to the config in integration and watch how it goes:
Even if we do it in production, it should do nothing but marking some replicas suspicious depending on transfer errors, which has no effect as long as
suspicious-replica-recoveror
daemon is not running. But let's not do anything in production for now, especially before the xmas break @ericvaandering @dynamic-entropy any thoughts?