NetManAIOps / CIRCA

Causal Inference-based Root Cause Analysis
BSD 3-Clause "New" or "Revised" License
68 stars 11 forks source link

Service level indicator question #6

Closed nghson closed 1 year ago

nghson commented 1 year ago

Hi, as I understand, when there is SLI metric violation then we try to find the root cause for that violation. But in my case, circa gives scores for non-parent nodes of the SLI which are not responsible for the change in it. Moreover, changing SLI still gives the same result and scores. Could you please clarify this?

limjcst commented 1 year ago

It sounds like a known "feature".

CIRCA does score every node. The underlying philosophy is that we treat the root causes as a fault's observed projection in the space defined by monitoring metrics. Because the fault is an unobserved confounder of root cause metrics, there may be a monitoring metric that is not an ancestor of the SLI but indicates the fault directly, as if it is the fault's "side effect". A node's score is calculated based on the node itself and its parents. Hence, the change in any other nodes will not influence the score.

If you are a researcher, this can be an opportunity for you to extend the existing literature. One necessary step may be answering what kind of root causes operators need. If you are looking for a satisfying RCA tool, a simple solution is to prune the graph and reserve the SLI's ancestors.

nghson commented 1 year ago

Thanks for the answer. In the sample code data = CaseData(

circa.model.data_loader.MemoryDataLoader is derived from

# circa.model.data_loader.DataLoader, which manages data with configurations
data_loader=MemoryDataLoader(mock_data_with_time),
sli=latency,
detect_time=240,
lookup_window=4,
detect_window=2,

), if changing sli does not influence the score then what is the purpose of the param sli?

limjcst commented 1 year ago

I am afraid that I have abused the word CIRCA.

nghson commented 1 year ago

Ok, many thanks :D