dmwm / CMSRucio

7 stars 31 forks source link

Enhancement: Capture downtime for site availability status updates #828

Closed dynamic-entropy closed 2 months ago

dynamic-entropy commented 4 months ago

Enhancement Description

I propose that we consider including downtime when evaluating the write availability of a site. ( This can be done as a relatively quick fix while we wait for a metric independent for storage availability ).

Use Case

Incorporating down15min as a metric to update a site's read and write availability will prevent dumping doomed transfers on FTS and the site itself.

Possible Solution

Question for @stlammel

  1. Why is the status overridden for the site? https://cmssst.web.cern.ch/siteStatus/detail.html?site=T1_IT_CNAF

  2. Should there not be a status change for a site in unscheduled downtime? As opposed to a scheduled downtime it does suggest error and not neutral.

Related Issues

No response

stlammel commented 4 months ago

So, an unscheduled downtime is a way for a site to tell the users that they are aware of the issue to avoid each VO making a ticket etc. Otherwise there is no special handling and any status changes based on the error/failure the service causes. In the case of T1_IT_CNAF, ProdStatus and CrabStatus changed to drain/disabled middle of last week. I agree for transfers we want something different thus the new Status suggestion/plan.

stlammel commented 4 months ago

Tier-1s we usually put into WaitingRoom state only manually, thus the LifeStatus override. It's a historic thing...

dynamic-entropy commented 4 months ago

What do you suggest we do for now in that case? Now that we had this with a second site, it's good to put something quick in place at least for the time being. Use down15min?

stlammel commented 4 months ago

Well, the downtime metric has site downtimes also due to compute services being in maintenance. It has entries/docs for the WebDAV service(s). We could put CNAF manually into WaitingRoom state if the current situation causes trouble.

dynamic-entropy commented 4 months ago

We could put CNAF manually into the WaitingRoom state if the current situation causes trouble.

We did switch it off in Rucio manually. But that was after multiple messages were exchanged over GGUS and mails. Would be a relief to let it happen automatically.

haozturk commented 4 months ago

Hi all, there's a second site affected by this: IIHE failing +5k transfers hourly [1,2]. We need an automatic solution as soon as possible. We cannot operate this manually. What's the way forward here?

As a fast solution, using the downtime metric still looks better than what we're doing now, no?

[1] https://monit-grafana.cern.ch/goto/KeehAz9SR?orgId=11 [2] https://cmssst.web.cern.ch/sitereadiness/report.html#T2_BE_IIHE