dmwm / CMSRucio

7 stars 31 forks source link

[DEV] Suspicious replica recoveror: Configure conveyor-finisher such that it starts marking replicas suspicious #692

Open haozturk opened 9 months ago

haozturk commented 9 months ago

Context: https://indico.cern.ch/event/1356295/ Needed for https://github.com/dmwm/CMSRucio/issues/403

I suggest to put the following patterns to the config in integration and watch how it goes:

- No such file or directory,
- CHECKSUM MISMATCH Source and
destination checksums do not match,
- SOURCE CHECKSUM MISMATCH,
- Unable to read file - wrong file checksum,
- checksum verification failed

Even if we do it in production, it should do nothing but marking some replicas suspicious depending on transfer errors, which has no effect as long as suspicious-replica-recoveror daemon is not running. But let's not do anything in production for now, especially before the xmas break @ericvaandering @dynamic-entropy any thoughts?

haozturk commented 8 months ago

I've added the suspicious pattern to the finisher config in the integration:

{"message": "Suspicious patterns: ['.*No such file or directory.*', '.*no such file or directory.*', '.*CHECKSUM MISMATCH Source and destination checksums do not match.*', '.*SOURCE CHECKSUM MISMATCH.*', '.*Unable to read file - wrong file checksum.*', '.*checksum verification failed.*', '.*direct_access.*', '.*Copy failed with mode 3rd pull', 'with error: Transfer failed: failure: Server returned nothing.*', '.*HTTP 404 : File not found.*']", "error": {"type": null, "message": null, "stack_trace": null}, "log": {"logger": "root", "level": "DEBUG"}, "process": {"pid": 9}, "@timestamp": "2024-01-19T12:26:05.379Z"}

Now working on producing relevant transfer errors to see whether finisher will be able to mark them as suspicious

haozturk commented 8 months ago

I produced some stuck rules with checksum problems in the integration. The finisher is able to spot them, however not able to mark them as suspicious

{"log": {"logger": "root", "level": "DEBUG"}, "@timestamp": "2024-01-23T14:06:23.430Z", "process": {"pid": 9}, "message": "Checking suspicious file for request: 23c97c6470fd44d88d1dfa4abce0ffa9, transfer error: RequestErrMsg.TRANSFER_FAILED:TRANSFER [5] SOURCE CHECKSUM MISMATCH Source and user-defined ADLER32 do not match (6bc75436 != 18134cde)", "error": {"type": null, "message": null, "stack_trace": null}}
{"log": {"logger": "root", "level": "DEBUG"}, "@timestamp": "2024-01-23T14:06:23.433Z", "process": {"pid": 9}, "message": "Found suspicious urls: ['davs://eoscms.cern.ch:443/eos/cms/store/backfill/1/DMWM_Test/RelValSinglePiE50HCAL/GEN-SIM/TC_SLC7_CMS_Home_IDRv5d-v11/00000/99141D34-7C33-3747-B330-E16940FA6DAB.root']", "error": {"type": null, "message": null, "stack_trace": null}}

Problem seems to be in the get_pfn_to_rse function [1] which cannot obtain the rse from the pfns. Looking deeper.

[1] https://github.com/rucio/rucio/blob/e819e3c001d8613076728ffd8f424fa3f2bdc812/lib/rucio/core/replica.py#L445

haozturk commented 8 months ago

Okay, I confirm that the reason why conveyor-finisher cannot mark a replica suspicious in the bad_replicas table is this ATLAS specific code [1]

In particular the did name is computed incorrectly. For instance, the replicas variable gets the following value

[{'scope': store, 'name': '9E814597-78C2-CE44-BDAB-C1F333E4BAF8.root', 'rse_id': '0f87087236244c4ca443472498e0aef2', 'state': <BadFilesStatus.SUSPICIOUS: 'S'>}]

for the following pfn

davs://eoscms.cern.ch:443/eos/cms/store/temp/sitetest/store/mc/RunIISummer20UL17MiniAODv2/GluGluSpin0ToGammaGamma_W-0p014_M-650_TuneCP2_13TeV_pythia8/MINIAODSIM/106X_mc2017_realistic_v9BACKFILL-v3/00000/9E814597-78C2-CE44-BDAB-C1F333E4BAF8.root

where the name of the replica is incorrect.

@ericvaandering @voetberg you mentioned that this is already known. Should we handle this as a 3rd step in [2] or is there already a ticket open for this in Rucio?

[1] https://github.com/rucio/rucio/blob/e819e3c001d8613076728ffd8f424fa3f2bdc812/lib/rucio/core/replica.py#L273 [2] https://github.com/rucio/rucio/issues/6396

voetberg commented 8 months ago

@haozturk Yep, this is already known and there's an open PR with a solution. Issue https://github.com/rucio/rucio/issues/6363 and PR https://github.com/rucio/rucio/pull/6444. It's based on vo's instead of config settings but should have similar results.

haozturk commented 8 months ago

Ah, thanks a lo @voetberg . I'll keep this issue in Waiting then. I can re-test it once the PR is merged and installed in the integration cluster.

haozturk commented 4 months ago

Hi @voetberg I patched integration cluster with your PR 6444 for testing and I see that we still can't mark replicas suspicious. This is the relevant snippet from the finisher logs:

{"@timestamp": "2024-05-23T14:45:50.229Z", "log": {"logger": "root", "level": "DEBUG"}, "message": "Checking suspicious file for request: 22cc26303cb9423f8d67f273e263e476, transfer error: RequestErrMsg.TRANSFER_FAILED:TRANSFER [93] Protocol not supported or path/url invalid: gsiftp://gftp.t2.ucsd.edu:2811/hadoop/cms/phedex/store/test/rucio/int/store/data/Run2016F/MuonEG/MINIAOD/HIPM_UL2016_MiniAODv2-v2/280000", "error": {"type": null, "message": null, "stack_trace": null}, "process": {"pid": 9}}
{"@timestamp": "2024-05-23T14:45:50.233Z", "log": {"logger": "root", "level": "DEBUG"}, "message": "Found suspicious urls: ['srm://cmsdcatape.fnal.gov:8443/srm/managerv2?SFN=/11/store/data/Run2016F/MuonEG/MINIAOD/HIPM_UL2016_MiniAODv2-v2/280000/B3FF92F9-855D-5144-BA28-3877560A93B2.root']", "error": {"type": null, "message": null, "stack_trace": null}, "process": {"pid": 9}}
{"@timestamp": "2024-05-23T14:45:50.253Z", "log": {"logger": "root", "level": "DEBUG"}, "message": "gfal.Default: parsing 1 pfns", "error": {"type": null, "message": null, "stack_trace": null}, "process": {"pid": 9}}

and this shows that we couldn't mark them suspicious:

$ rucio-int list-suspicious-replicas
RSE Expression:    Scope:    Created at:    Nattempts:    File Name:
-----------------  --------  -------------  ------------  ------------

rucio-int is an alias to alias rucio-int="rucio -H http://cms-rucio-int.cern.ch --auth-host https://cms-rucio-auth-int.cern.ch"

Can you please help me understand what's the issue?

voetberg commented 4 months ago

6444 only covers correctly parsing the scope and name in "declare bad replicas". Did you also patch with https://github.com/rucio/rucio/pull/6764 ?

haozturk commented 4 months ago

6764 shouldn't be necessary to mark replicas suspicious. The suspicious replica recoveror daemon doesn't mark replicas suspicious. It processes the replicas that are marked suspicious and declares them bad when necessary. I'm testing out the parts which mark replicas suspicious which happens in two places in rucio: kronos and conveyor-finisher. In the context of this issue, I'm testing the latter and simply this line [1] doesn't work

[1] https://github.com/rucio/rucio/blob/master/lib/rucio/daemons/conveyor/finisher.py#L373

voetberg commented 4 months ago

Ah okay I misunderstood you there - was looking for the obvious solution.

Then I'm not immediately sure, I'm as stumped as you are. Can you verify the test rse is deterministic? Or else I can jump in and start adding logging statements everywhere to see where this gets stuck

haozturk commented 4 months ago

Thanks Maggie, yes it's deterministic:

$ rucio-admin-int rse info T1_US_FNAL_Disk | grep determ
  deterministic: True

It'd indeed help adding more logging. Current logs don't help much. I couldn't figure out an easy way to add logging lines. Probably we need to add new log lines, make a temporary patch, deploy it and watch what's going on.

haozturk commented 3 months ago

Hi @voetberg I've found what's wrong. scope and name extraction from pfn looks problematic. For this pfn

srm://cmsdcatape.fnal.gov/11/store/data/Run2016F/MuonEG/MINIAOD/HIPM_UL2016_MiniAODv2-v2/280000/B3FF92F9-855D-5144-BA28-3877560A93B2.root

it yields

name: B3FF92F9-855D-5144-BA28-3877560A93B2.root
scope: 11

whereas it should've been

scope = "cms"
name = "/store/data/Run2016F/MuonEG/MINIAOD/HIPM_UL2016_MiniAODv2-v2/280000/B3FF92F9-855D-5144-BA28-3877560A93B2.root

which is computed here: https://github.com/rucio/rucio/blob/cf9c2c7afad37fc5b7bdd782e281918b4230b2f1/lib/rucio/core/replica.py#L275

As far as I'm aware scope cannot be extracted from PFN in CMS. Once you know the lfn and the rse_id, you can get it from the replicas table. If there are multiple replicas with the same name and rse_id but different scopes, we can default to cms which is the scope we use for all centrally produced data. Maybe @ericvaandering has a better idea?

Does it sound reasonable? You want me to fix this upstream?

voetberg commented 3 months ago

@haozturk The way the code is set up, you may be able to write a plugin that just returns "cms" for the scope for any pfn. Right now it's written so that it uses VO to determine the code to use in the scope from pfn. I think it would be as simple as patching in

class CMSScopeTranslation(RSEDeterministicScopeTranslation):

    def __init__(self, vo: str = 'def'):
        super().__init__()
        self.register("cms", CMSScopeTranslation._cms)   # <- Name here corresponds to vo

    @static_method
    def _cms(pfn:str) -> str: 
          return "cms"
haozturk commented 3 months ago

Thanks @voetberg indeed this would be the most convenient. However, I'm a bit worried for the exceptions where scope isn't cms. If @ericvaandering has no other suggestion, I can apply it.

Lastly, we need to change the way we extract the name/lfn, but that should be easy.

dynamic-entropy commented 3 months ago

Igor did something similar here, https://github.com/rucio/rucio/pull/6350 Not sure if it's entirely the same though.