dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
46 stars 107 forks source link

MSOutput Should not silently tolerate exceptions during rule creation #11941

Open klannon opened 7 months ago

klannon commented 7 months ago

Impact of the bug MSOutput

Describe the bug A change in Rucio caused tape output rule creation to fail and we missed this for 3 weeks causing 7 PB of tape transfers to pile up.

How to reproduce it Break Rucio

Expected behavior If MSOutput fails to create a rule it, it should trigger an alarm, at least if it fails for multiple cycles.

Additional context and error message None

amaltaro commented 3 months ago

And I just stumbled upon this issue, after resolving another 3-4 weeks outage of rule creation in MSOutput, addressed in this ticket: https://github.com/dmwm/WMCore/issues/12044

I am setting this ticket to Q4 such that we can at least implement an alarm and get notified when the whole MSOutputConsumer cycle is skipped.

mapellidario commented 3 weeks ago

After a private discussion with Alan, we decided that I can start working on this issue. so far, we agreed on a two-pronged approach:

We considered the idea of having MSOutput sending an alert if it fails to process a workflow N times, but it would require to implement some new logic to keep track of past attempts. Too much effort developing new code when we can achieve the same result exploiting existing monitoring infrastructure.