dmwm / CMSRucio

7 stars 31 forks source link

Feature: Deploy suspicious replica recoverer daemon #806

Open haozturk opened 1 month ago

haozturk commented 1 month ago

Feature Description

With https://github.com/rucio/rucio/issues/6396 fixed, this daemon should be ready to deploy. It won't do anything until we successfully start marking replicas suspicious, but we don't need to wait for other issues to fixed to do it. Will work on this a

Use Case

https://indico.cern.ch/event/1356295/

Possible Solution

To be figured out by checking how other daemons are deployed

Related Issues

@voetberg fyi

haozturk commented 1 month ago
  1. I deployed the daemon https://github.com/dmwm/rucio-flux/pull/294/files
  2. Added its config file [1] as a secret:
    1. https://github.com/dmwm/rucio-flux/pull/297
    2. https://github.com/dmwm/rucio-flux/pull/298
  3. Added the necessary configs
    [haozturk@lxplus9107 ~]$ rucio-admin-int config set --section replicarecoverer  --option rule_rse_expression --value "cms_type=int"
    Set configuration: replicarecoverer.rule_rse_expression=cms_type=int
    [haozturk@lxplus9107 ~]$ rucio-admin-int config set --section replicarecoverer  --option use_file_metadata --value False
    Set configuration: replicarecoverer.use_file_metadata=False
    [haozturk@lxplus9107 ~]$ rucio-admin-int config set --section replicarecoverer  --option did_name_expression --value "RAW"
    Set configuration: replicarecoverer.did_name_expression=RAW
  4. I declared a replica suspicious 5 times manually (in a very hacky way. The automatic suspicious declaration doesn't work at the moment [1])
    $ rucio-int list-suspicious-replicas
    RSE Expression:        Scope:    Created at:            Nattempts:  File Name:
    ---------------------  --------  -------------------  ------------  -------------------------------------------------------------------------------------------------------------
    T1_US_FNAL_Tape_Input  cms       2021-10-28 15:35:01             5  /store/data/Run2016F/MuonEG/MINIAOD/HIPM_UL2016_MiniAODv2-v2/280000/B3FF92F9-855D-5144-BA28-3877560A93B2.root
  5. Now, trying to fix the next issue:
    {"message": "[1/6]: Exception\nProvided RSE expression is considered invalid.\nDetails: RSE Expression resulted in an empty set.\n  File \"/usr/local/lib/python3.9/site-packages/rucio/daemons/common.py\", line 215, in _generator\n    result = run_once_fnc(heartbeat_handler=heartbeat_handler, activity=activity)\n  File \"/usr/local/lib/python3.9/site-packages/rucio/daemons/replicarecoverer/suspicious_replica_recoverer.py\", line 242, in run_once\n    rse_list = sorted([rse for rse in parse_expression('enable_suspicious_file_recovery=true') if rse['vo'] == vo], key=lambda k: k['rse'])\n  File \"/usr/local/lib/python3.9/site-packages/rucio/db/sqla/session.py\", line 453, in new_funct\n    result = function(*args, session=session, **kwargs)\n  File \"/usr/local/lib/python3.9/site-packages/rucio/core/rse_expression_parser.py\", line 95, in parse_expression\n    raise InvalidRSEExpression('RSE Expression resulted in an empty set.')\n", "error": {"type": "InvalidRSEExpression", "message": "Provided RSE expression is considered invalid.\nDetails: RSE Expression resulted in an empty set.", "stack_trace": "  File \"/usr/local/lib/python3.9/site-packages/rucio/daemons/common.py\", line 215, in _generator\n    result = run_once_fnc(heartbeat_handler=heartbeat_handler, activity=activity)\n  File \"/usr/local/lib/python3.9/site-packages/rucio/daemons/replicarecoverer/suspicious_replica_recoverer.py\", line 242, in run_once\n    rse_list = sorted([rse for rse in parse_expression('enable_suspicious_file_recovery=true') if rse['vo'] == vo], key=lambda k: k['rse'])\n  File \"/usr/local/lib/python3.9/site-packages/rucio/db/sqla/session.py\", line 453, in new_funct\n    result = function(*args, session=session, **kwargs)\n  File \"/usr/local/lib/python3.9/site-packages/rucio/core/rse_expression_parser.py\", line 95, in parse_expression\n    raise InvalidRSEExpression('RSE Expression resulted in an empty set.')\n"}, "@timestamp": "2024-06-11T10:06:26.377Z", "log": {"level": "CRITICAL", "logger": "root"}, "process": {"pid": 9}}

[1]

[
    {
        "action": "ignore",
        "datatype": ["RAW"],
        "scope": []
    },
    {
        "action": "declare bad",
        "datatype": [],
        "scope": []
    }
]

[2] https://github.com/dmwm/CMSRucio/issues/692#issuecomment-2133123551