dmwm / CMSRucio

7 stars 31 forks source link

Enhancement: Implement monitoring for suspicious replica recoverer #862

Open haozturk opened 4 days ago

haozturk commented 4 days ago

Enhancement Description

We need monitoring for the actions that replica recoverer daemon takes

Use Case

Possible Solution

I think we simply need the suspicious replicas that the daemon processes (file name and RSE or simply PFN) and the action it took (ignore, create rule, declare bad or declare temporary unavailable). Ideally these should be pushed to rucio event monitoring.

Related Issues

No response

haozturk commented 3 days ago

A high level monitoring is already available in FTS monitoring [1]. Rucio uses Recovery activity for the transfers that replace bad replicas with healthy ones. It's already visible that this activity picked up after we enabled the daemon.

[1] https://monit-grafana.cern.ch/d/mtQFDScGk/cms-fts-metrics?from=1730234072240&orgId=11&to=1731493015953&var-activity=Recovery&var-bin=1h&var-dst_rse=All&var-fts_server=All&var-group_by=dst_rse&var-src_rse=All&var-vo=cms&var-protocol=All&viewPanel=11&var-auth_method=All