cmu-delphi / covidcast-indicators

Back end for producing indicators and loading them into the COVIDcast API.
https://cmu-delphi.github.io/delphi-epidata/api/covidcast.html
MIT License
12 stars 17 forks source link

Make SirCAL alert when lag is below expected lower threshold #1918

Open melange396 opened 11 months ago

melange396 commented 11 months ago

Sir Complains-A-Lot (aka SirCAL) alerts when lag exceeds particular thresholds for what is "typical" for each indicator. The thresholds were set by hand based on what had been observed at some point in time, with an expectation that the reporting pattern will remain consistent. The alerts help identify problems, but for a number of potential legitimate reasons, the typical lag from data sources might change -- when there is a new longer "expected" lag, alerts will fire somewhat regularly, and then the team can investigate and increase the threshold as appropriate.

In fact, our indicators seem to have typical/expected lag "ranges", so in addition to a typical lag "max", we also see a typical lag "min". For example, nchs-mortality is a weekly signal that currently varies between 12 and 18 days of lag (12 on the day that a new update is released, and up to 18 over the course of the rest of the week). For a number of potential legitimate reasons, a data source might be able to decrease their typical lag, such as improvements in their reporting or processing pipelines. If we alert when the lag we see is below a minimum bound, we can detect and respond to this. If the example used above was able to shave a day off their cycle, the new range would be between 11 and 17 days of lag. This means we can change the max lag threshold to get a tighter bound, but only if this is brought to our attention, thus necessitating this new alert condition.

TL;DR: we should alert when lag is outside of a range (instead of just when it exceeds an upper bound) so that we can identify changes in reporting patterns and adjust thresholds appropriately.

"max_age" detection and alert generation code: https://github.com/cmu-delphi/covidcast-indicators/blob/6c4c5b98be5e7b1c25293335162cd80b3e9e21f1/sir_complainsalot/delphi_sir_complainsalot/check_source.py#L98-L106

Threshold specification(s): https://github.com/cmu-delphi/covidcast-indicators/blob/6c4c5b98be5e7b1c25293335162cd80b3e9e21f1/ansible/templates/sir_complainsalot-params-prod.json.j2#L45-L46 https://github.com/cmu-delphi/covidcast-indicators/blob/6c4c5b98be5e7b1c25293335162cd80b3e9e21f1/sir_complainsalot/params.json.template#L45-L46

rlunde commented 10 months ago

Do we save the observed lag for a signal in a database? It might be interesting to be able to use data analysis on collected data to see a measure of variability (for example).

melange396 commented 10 months ago

Lag for all datapoints of our signals can be pulled from our database or api without too much trouble... But then doing an analysis on that across different dimensions might be worthy of a publication, and thus outside the scope of this issue! We have "max lag" in some of our internal dashboards and the variability is pretty boring; we mostly see sawtooth patterns where the lag increases by 1 every day when there are no updates, and then it jumps back down when a data drop happens. I can point it out for you in elastic/kibana some time, but im sure youve actually already seen it.