Structural graph question

nghson commented 1 year ago

Hi, can the structural graph construction phase be applied to an arbitrary dataset, e.g. if I have a simple dataset with metrics s1, s2, s3, and s1 -> s3, s2 -> s3? I looked at the generation of the simulation dataset and it seems that they are generated randomly, except for the constraint that the first node does not have any children. So

Could you please guide me through using CIRCA for some arbitrary dataset, if possible.
Yhy the SLI must not have any children?
In your example code of basic usage, sli=latency. The code does not have this but latency has errors as its child in the paper so I'm a bit confused.

limjcst commented 1 year ago

Graph for a new dataset

If you already have the graph in your mind, just use StaticGraphFactory with MemoryGraph. MemoryGraph takes a networkx.DiGraph as its input. Please ensure that each node of the networkx.DiGraph has a type of circa.model.graph.Node. For example,

import networkx as nx

from circa.graph.common import StaticGraphFactory
from circa.model.graph import MemoryGraph
from circa.model.graph import Node

entity = "entity"

g = nx.DiGraph([("s1", "s3"), ("s2", "s3")])
g = nx.DiGraph(
    [
        (Node(entity=entity, metric=src), Node(entity=entity, metric=dest))
        for src, dest in g.edges
    ]
)
graph = MemoryGraph(g)
print(graph.parents(Node(entity=entity, metric="s3")))

graph_factory = StaticGraphFactory(graph)
# Assemble an algorithm with graph_factory

Why the SLI must not have any children

A real-world scenario does not need such a requirement. This is not a restriction to CIRCA, either.

It was difficult for me to decide whether CaseData should take one SLI or a set of SLIs as its input. Finally, I chose the former. The main reason is that, one SLO violation is enough to trigger RCA. The SLO violation-related SLI is mandatory for RCA as an entry, while the other SLIs may not be anomalous. Hence, a tree generated in the simulation study reserves SLO violation-related SLI and its ancestors only.

The second (but much weaker) reason is for simplicity. Such a setting, as well as the linear data generation model (i.e., VAR) makes it easy to judge what should happen.

Why latency does not has errors as its child in the example

The structural graph makes different use of a meta metric and a monitoring metric. In general, we can take errors as a child of latency, both of which are meta metrics. On the other hand, there can be no corresponding monitoring metrics for errors. As for the example, latency is a monitoring metric.

I would rather take the four meta metrics listed in the paper as data for the structural graph, not components. To construct a structural graph for your own dataset with any kinds of metrics, start with tests/alg/sgraph/. This folder contains templates with causal assumptions and mappings between meta metrics and monitoring metrics. For example,

from circa.graph.structural import StructuralGraph
from circa.graph.structural import StructuralGraphFactory

graph = StructuralGraph(filename="tests/alg/sgraph/index.yml")
graph_factory = StructuralGraphFactory(graph)

nghson commented 1 year ago

Thank you for the detailed reply. How should the monitoring metrics be mapped to the meta metrics if they are arbitrary? How were the metrics in the simulation mapped?

limjcst commented 1 year ago

How should the monitoring metrics be mapped to the meta metrics

For an arbitrary dataset, the first thing for the structural graph construction is to identify meta metrics. Such a process is related to the Grounded Theory method. Read The Creation of Theory: A Recent Application of the Grounded Theory Method for more details. After that, the mapping is trivial, defined by the definition of each meta metric.

As for the online service systems stated in the title of our paper, we use traffic, saturation, latency, and errors (combined with each component, e.g., a database instance) as four kinds of meta metrics. According to their definition, the common monitoring metric average response time should be mapped to latency, while queries per second will be mapped to traffic.

These four meta metrics are intuitive, mentioned as four golden signals in the book Site Reliability Engineering. Here is one extra comment for anyone wondering the origin of our causal assumptions. We tried to conduct randomized control trials but failed, where the causal assumptions were summarized. As such a failed attempt will never be peer-reviewed or published, I have not prepared any details to share.

How were the metrics in the simulation mapped

The structural graph is not used in our simulation study. Hence, there is no metric mapping. As stated in our paper (at the beginning of Section 5.2.2), the graph deduced by the weighted adjacent matrix is used directly in the simulation study.

The structural graph construction is a method, not the target. I suggest using a better graph if you already have one.

nghson commented 1 year ago

Great. Thank you very much for your help :)

NetManAIOps / CIRCA