Open mapellidario opened 8 months ago
No need for a new SQL/API. Code to set task status is already in https://github.com/dmwm/CRABServer/blob/9cb5d51245477e0bfa68fd2fc14931a620f7df15/src/python/TaskWorker/Actions/Recurring/TapeRecallManager.py#L232-L246 As indicated in there, this is a good chance to move it to some common place, together with other methods which are now replicated here and there.
As to " code where those failures are triggered" we can surely do this a bit at a time starting from the most obnoxious things (stageoutcheck failure and inexisting dataset), the simplest way is probably to add a new Exception to be handled in https://github.com/dmwm/CRABServer/blob/9cb5d51245477e0bfa68fd2fc14931a620f7df15/src/python/TaskWorker/Worker.py#L97-L116
We receive an alert when the number of tasks that failed to be submitted in the last 2h is higher than a certain threshold.
However, tasks fail to be submitted both because of user errors (no quota at destination site, no dataset found in dbs, ...) and because of crab system (3000 running dagmans, crabserver rest could not contact oracle, ... ) or crab dependencies (rucio, ...) outages. This means that in order to monitor crab system availability, we receive false positives [1], which for example made us slow to respond to a recent rucio outage [1].
In order to improve our trust in our alerts, we would a new way of distinguishing tasks that are not submitted because of a system outage from the ones that are not submitted because of a problem with the user request.
Stefano proposed to introduce a new status
submitrefused
, which would have a different meaning from submitfailed :submitfailed
"crab system encountered a generic problem with task submission", which would indicate problems with crab itself or its dependenciessubmitrefused
"crab could not submit a task because of a problem in the user request"I suggest that
submitrefused
[1] https://codimd.web.cern.ch/G8BrXmKZQPyfoizEUOCmVQ#