Closed dynamic-entropy closed 2 months ago
So, an unscheduled downtime is a way for a site to tell the users that they are aware of the issue to avoid each VO making a ticket etc. Otherwise there is no special handling and any status changes based on the error/failure the service causes. In the case of T1_IT_CNAF, ProdStatus and CrabStatus changed to drain/disabled middle of last week. I agree for transfers we want something different thus the new Status suggestion/plan.
Tier-1s we usually put into WaitingRoom state only manually, thus the LifeStatus override. It's a historic thing...
What do you suggest we do for now in that case? Now that we had this with a second site, it's good to put something quick in place at least for the time being.
Use down15min
?
Well, the downtime metric has site downtimes also due to compute services being in maintenance. It has entries/docs for the WebDAV service(s). We could put CNAF manually into WaitingRoom state if the current situation causes trouble.
We could put CNAF manually into the WaitingRoom state if the current situation causes trouble.
We did switch it off in Rucio manually. But that was after multiple messages were exchanged over GGUS and mails. Would be a relief to let it happen automatically.
Hi all, there's a second site affected by this: IIHE failing +5k transfers hourly [1,2]. We need an automatic solution as soon as possible. We cannot operate this manually. What's the way forward here?
As a fast solution, using the downtime metric still looks better than what we're doing now, no?
[1] https://monit-grafana.cern.ch/goto/KeehAz9SR?orgId=11 [2] https://cmssst.web.cern.ch/sitereadiness/report.html#T2_BE_IIHE
Enhancement Description
I propose that we consider including
downtime
when evaluating the write availability of a site. ( This can be done as a relatively quick fix while we wait for a metric independent for storage availability ).Use Case
Incorporating
down15min
as a metric to update a site's read and write availability will prevent dumping doomed transfers on FTS and the site itself.Possible Solution
Question for @stlammel
Why is the status overridden for the site? https://cmssst.web.cern.ch/siteStatus/detail.html?site=T1_IT_CNAF
Should there not be a status change for a site in
unscheduled downtime
? As opposed to ascheduled downtime
it does suggest error and not neutral.Related Issues
No response