NetManAIOps / CIRCA

Causal Inference-based Root Cause Analysis

BSD 3-Clause "New" or "Revised" License

68 stars 11 forks source link

Service level indicator question #6

Closed nghson closed 1 year ago

nghson commented 1 year ago

Hi, as I understand, when there is SLI metric violation then we try to find the root cause for that violation. But in my case, circa gives scores for non-parent nodes of the SLI which are not responsible for the change in it. Moreover, changing SLI still gives the same result and scores. Could you please clarify this?

limjcst commented 1 year ago

It sounds like a known "feature".

CIRCA does score every node. The underlying philosophy is that we treat the root causes as a fault's observed projection in the space defined by monitoring metrics. Because the fault is an unobserved confounder of root cause metrics, there may be a monitoring metric that is not an ancestor of the SLI but indicates the fault directly, as if it is the fault's "side effect". A node's score is calculated based on the node itself and its parents. Hence, the change in any other nodes will not influence the score.

If you are a researcher, this can be an opportunity for you to extend the existing literature. One necessary step may be answering what kind of root causes operators need. If you are looking for a satisfying RCA tool, a simple solution is to prune the graph and reserve the SLI's ancestors.

nghson commented 1 year ago

Thanks for the answer. In the sample code data = CaseData(

circa.model.data_loader.MemoryDataLoader is derived from

# circa.model.data_loader.DataLoader, which manages data with configurations
data_loader=MemoryDataLoader(mock_data_with_time),
sli=latency,
detect_time=240,
lookup_window=4,
detect_window=2,

), if changing sli does not influence the score then what is the purpose of the param sli?

limjcst commented 1 year ago

I am afraid that I have abused the word CIRCA.

As an RCA algorithm, CIRCA (specifically speaking, circa.alg.ci.RHTScorer combined with circa.alg.ci.DAScorer) does not rely on sli.
As a package, the param sli is prepared for other RCA algorithms. For example,
- MicroHECL will traverse the call graph, starting from the initial anomalous service, where the anomalous metric of the initial anomalous service will be the sli.
- circa.alg.dfs.MicroHECLScorer is our corresponding implementation.

nghson commented 1 year ago

Ok, many thanks :D