Perform RCA immediately after anomaly detection

AICoE / prometheus-anomaly-detector

A newer more updated version of the prometheus anomaly detector (https://github.com/AICoE/prometheus-anomaly-detector-legacy)

GNU General Public License v3.0

596 stars 150 forks source link

Perform RCA immediately after anomaly detection #165

Closed saswat0 closed 2 years ago

saswat0 commented 3 years ago

Currently the model monitors the behaviour of application/service level metrics and flags them as anomalous. The issue resolution is still left up to the engineers and time taken for debugging might increase the outage duration

Should we add RCA of the anomaly on the fly as soon as it is detected? This will probably help in pinpointing the cause and allow faster repairs

@4n4nd Suggestions?

4n4nd commented 3 years ago

@saswat0 could you please give us more information on how the implementation for this would work?

cc @chauhankaranraj @hemajv

saswat0 commented 3 years ago

An anomaly is usually triggered when either the application, its downstream services (dependent apps) or the runtime environment crashes. On the event of an anomaly, we could check for inconsistency in the metrics of the affected application, all other downstream services or metrics from the deployed cluster/node and figure out which of them was responsible for the outage

Say for example, we have a micro-service A that handles date-time conversion. This being an inherent part for an application to function, is used by services B & C, which call A's endpoint frequently. Now if B has to suddenly handle a huge spike of requests and brings down A in the process, all of A, B & C would be affected. In such cases, figuring out the point of origin becomes difficult. The proposed implementation should aim at mitigating this delay. Once there's an exact match/pattern between two faulty services, we can easily figure out which is affected by which and resolve it

Hope it was clear 😓