dmwm / CRABServer

15 stars 38 forks source link

TW monitoring - differentiate submitfailed caused by crab system from user errors #7989

Open mapellidario opened 8 months ago

mapellidario commented 8 months ago

We receive an alert when the number of tasks that failed to be submitted in the last 2h is higher than a certain threshold.

However, tasks fail to be submitted both because of user errors (no quota at destination site, no dataset found in dbs, ...) and because of crab system (3000 running dagmans, crabserver rest could not contact oracle, ... ) or crab dependencies (rucio, ...) outages. This means that in order to monitor crab system availability, we receive false positives [1], which for example made us slow to respond to a recent rucio outage [1].

In order to improve our trust in our alerts, we would a new way of distinguishing tasks that are not submitted because of a system outage from the ones that are not submitted because of a problem with the user request.

Stefano proposed to introduce a new status submitrefused, which would have a different meaning from submitfailed :

I suggest that


[1] https://codimd.web.cern.ch/G8BrXmKZQPyfoizEUOCmVQ#

belforte commented 8 months ago

No need for a new SQL/API. Code to set task status is already in https://github.com/dmwm/CRABServer/blob/9cb5d51245477e0bfa68fd2fc14931a620f7df15/src/python/TaskWorker/Actions/Recurring/TapeRecallManager.py#L232-L246 As indicated in there, this is a good chance to move it to some common place, together with other methods which are now replicated here and there.

As to " code where those failures are triggered" we can surely do this a bit at a time starting from the most obnoxious things (stageoutcheck failure and inexisting dataset), the simplest way is probably to add a new Exception to be handled in https://github.com/dmwm/CRABServer/blob/9cb5d51245477e0bfa68fd2fc14931a620f7df15/src/python/TaskWorker/Worker.py#L97-L116