azamikram / rcd

Root Cause Discovery: Root Cause Analysis of Failures in Microservices through Causal Discovery
MIT License
41 stars 5 forks source link

sock-shop data #3

Closed zmlin1998 closed 4 months ago

zmlin1998 commented 1 year ago

Hi,

I would like to inquire about the sock-shop-data folder. Are the names of each subfolder representative of root causes?

Thank you!

azamikram commented 1 year ago

Yes! The name of the folders are [failed-service]-[failure] where [failed-service] is the name of service and [failure] is the type of the failure (either memory leak or cpu hog).

zmlin1998 commented 1 year ago

In the paper, there are a F-node that indicates the trigger point and the descendants of the F-node that indicates the root cause.

Is the failed-serivce the root cause or the trigger point? Thanks!

zmlin1998 commented 1 year ago

For example: If the CPU hog is the root cause, and what is the F-node? Is F-node also CPU hog?

Thanks!

azamikram commented 1 year ago

I'm not sure what you mean by trigger point but you are right that the immediate descendants of the F-node are the potential root causes.

F-node is just a dummy node to find the interventional target (root cause) so its not clear what you mean by what is the F-node. If root cause of the failure is the change in the CPU utilization of microservice i, then ideally, F-node should have an outgoing edge to CPU of microservice i.

zmlin1998 commented 1 year ago

Trigger point is that if there is an anomaly in a metric, and the anomalous metric is the so-called trigger point.

For example,

If we see the metric "logging system is alive" is anomalous, this metric is a trigger point. Then we wanna to find what makes the system malfunction, if the problem is "CPU utilization", we can say that "CPU utilization" is the root cause.

Thanks for you explanation, I have known what is F-node here.

azamikram commented 1 year ago

It is still not clear what the trigger point is. You said if there is an anomaly in a metric then that metric is the trigger point. If we are assuming that the anomaly is only on the root cause node then actual root cause is the trigger point. But from your example, it seems that you already have an anomaly detection system in place that detects changes in metrics. When it detects a change in one of the metric, it calls that metric the trigger point. Its possible that the detected metric (trigger point) is not the root cause but just a node in the anomaly propagation chain. If this is the case then F-node will not output the trigger point but the actual root cause. So following your example, RCD will output "CPU Utilization" as the root cause of the failure.

On a side note, it is not clear what kind of metric is "logging system is alive". We use a general set of metrics (such as CPU utilization, memory usage, or latency etc (numerical values)) but I believe an application specific metric can also be use as long as it provides some insights about the underlying system.

zmlin1998 commented 1 year ago

Yes, there is already an anomaly detection system so that we have a trigger point.

Thank you very much for patiently explaining.