NetManAIOps / CIRCA

Causal Inference-based Root Cause Analysis

BSD 3-Clause "New" or "Revised" License

68 stars 11 forks source link

Time related Question #2

Closed nsankar closed 1 year ago

nsankar commented 1 year ago

@limjcst @lizeyan Thanks for developing this interesting approach and the package. . I have the following question

In the below example code taken from CIRCA, there are 6 sample data given for latency, traffic and saturation. What is the sampling time considered per data sample? Is it 60 seconds (1 minute) for each data sample ?
Detect time mentioned is : detect_time=240, As my assumption is 60 secs per data sample, Does this mean that in the 4th sample in the example data viz., latency =9, traffic=105, and saturation =6 corresponds to the fault data ? i.e. (60*4 = 240)?
What do the lookup_window and the detect_window settings signify?

Kindly clarify and explain. Thanks in advance.

`mock_data = { latency: (10, 12, 11, 9, 100, 90), traffic: (100, 110, 90, 105, 200, 150), saturation: (5, 4, 5, 6, 90, 85), } mock_data_with_time: Dict[str, Dict[str, Sequence[Tuple[float, float]]]] = defaultdict( dict ) for node, values in mock_data.items(): mock_data_with_time[node.entity][node.metric] = [ (index * 60, value) for index, value in enumerate(values) ] data = CaseData(

circa.model.data_loader.MemoryDataLoader is derived from

# circa.model.data_loader.DataLoader, which manages data with configurations
data_loader=MemoryDataLoader(mock_data_with_time),
sli=latency,
detect_time=240,
lookup_window=4,
detect_window=2,

limjcst commented 1 year ago

Thanks for your interest.

It is fine to take the sampling frequency as once per minute.
In the toy example, detect_time=240 refers to the 5th sample (saturation=90), as the timestamps of each time series are generated to be [0, 60, 120, 180, 240, 300].
- detect_time represents the time a fault is detected. Notice that detect_time may not point to the fault data, e.g., an anomaly detection method failed to notice the fault in time. However, detect_time is expected to be when the fault actually arises.
CaseData takes a parameter (interval) with the default value of 1 minute, i.e., the sampling frequency. Both lookup_window and detect_window take interval as their unit.
- After an RCA algorithm is triggered, some latest data are taken as faulty, the size of which is defined by detect_window.
- An RCA algorithm may take some data before detect_time for reference, the size of which is defined by lookup_window.

nsankar commented 1 year ago

@limjcst Thank you for clarifying.

When I ran the toy example , I got the following output. How to interpret the results? Is it correct to say that the metric with the highest score is the root cause? For instance here, score for Saturation is the highest i.e. 383.251. So saturation caused is the root cause of the problem in question. correct?

Also technically, what is the difference between the score and the z-score? Are they the same?

root@9d6d53736e6c:/app# python3 testRca.py ==> [(Node('DB', 'Saturation'), {'score': 383.2518754031084, 'info': {'z-score': 383.2518754031084, 'Confidence': 1.0}}), (Node('DB', 'Latency'), {'score': 355.00000000000006, 'info': {'z-score': 355.00000000000006, 'Confidence': 1.0}}), (Node('DB', 'Traffic'), {'score': 12.24744871391589, 'info': {'z-score': 12.24744871391589, 'Confidence': 1.0}})]

limjcst commented 1 year ago

The output produced by circa.alg.common.Model.analyze is a ranked list of metrics. An algorithm will put one metric at the first, if the algorithm thinks the metric is more likely to be the root cause than any other ones. The metric with the highest score may not be the root cause as the algorithm cannot be perfect.

The field info provides more details than score for debugging or unintended purposes. As CIRCA supports to stack multiple Scorer, score is the output of the last Scorer while info can save the outputs of previous Scorers. By now, there is no documentation for what each Scorer will record in info. Please read the code if you are interested.

nsankar commented 1 year ago

@limjcst Appreciate your guidance.