Causal inference for cloud operations data

Shreyanand commented 3 years ago

Cloud operational data refers to the data collected from deployed cloud applications and cloud infrastructure. Here, we focus on the data retrieved from the Operate First Openshift cluster. The smallest unit of an openshift cluster is a “pod” that performs one function. Several of these pods interact with each other to deliver a service. We retrieve metrics, logs, and events data from each of these pods that can be used for defining goals and problem statements.

One such goal we want to investigate is automated root cause analysis. In the case of failure of the cloud application, we want to be able to identify the root cause of the problem efficiently and reduce the recovery time. At scale, with many pods and services running, this can be challenging to do manually.

For the first step, we will study a simple example using the Jupyterhub application on the Operate First cluster. The initial approach we are going to explore for solving the problem of automated root cause analysis, is discovering the dependency graph between the pods that constitute the jupyterhub application and applying causal inference methods to it.

Issues 1, 2 highlight the specific task of understanding and retrieving data for the Jupyterhub application.

Milestones

[x] Find the dependency graph and collect data for the Jupyterhub example
[ ] Define a concrete research problem statement

Shreyanand commented 3 years ago

cc @MichaelClifford @treeinrandomforest @kgreenewald @onkarbhardwaj Feel free to add questions, suggestions, and links to relevant resources here.

Shreyanand commented 3 years ago

Given the dependency graph doc and the data collection notebook for jupyterhub, we can start exploring the ideas about what can we do with this data.

There are two problems that we thought could be interesting here:

Structure learning: If we are able to derive a structure from low level metrics and logs data, we could use it to understand complex applications that have a lot of pods interacting with each other. In the Jupyterhub example graph, that would mean determining which pods are connected and the direction of the arrows.
State prediction: Given the topology of the network, and the observed logs and metrics, compute the posterior distribution of underlying state variables. These state variables could be something like overall status of the pod, or status of pod memory, etc.

We can use this issue to brainstorm ideas and link resources. cc @TreeinRandomForest

kgreenewald commented 3 years ago

cc @MichaelClifford @TreeinRandomForest @kgreenewald @onkarbhardwaj Feel free to add questions, suggestions, and links to relevant resources here.

Hi @Shreyanand can you re-invite @onkarbhardwaj and invite @133martie (Lee Martie) to the github? We're getting started with this data after getting back from various vacations. Thanks

Shreyanand commented 3 years ago

@kgreenewald @onkarbhardwaj @133martie I see all of you are added to the repository now. Let me know if you need anything else :)

onkarbhardwaj commented 2 years ago

Hi @Shreyanand, when I try to run this notebook, I cannot download data from Prometheus because not having credentials. So instead, I tried to access the locations in which this notebook stores the data (../data/raw/jupyterhub/metrics/) but the data does not seem to exist in the repo.

Could I get access to Prometheus or the data itself? Thanks a lot!

Shreyanand commented 2 years ago

Hi @Shreyanand, when I try to run this notebook, I cannot download data from Prometheus because not having credentials. So instead, I tried to access the locations in which this notebook stores the data (../data/raw/jupyterhub/metrics/) but the data does not seem to exist in the repo.

Could I get access to Prometheus or the data itself? Thanks a lot!

Hi @onkarbhardwaj, due to recent data licensing concerns, we are working on obfuscating identifiers in the data. I'll update this thread as soon as it is done to make sure you have access to it.

kgreenewald commented 2 years ago

Hi @Shreyanand, when I try to run this notebook, I cannot download data from Prometheus because not having credentials. So instead, I tried to access the locations in which this notebook stores the data (../data/raw/jupyterhub/metrics/) but the data does not seem to exist in the repo. Could I get access to Prometheus or the data itself? Thanks a lot!

Hi @onkarbhardwaj, due to recent data licensing concerns, we are working on obfuscating identifiers in the data. I'll update this thread as soon as it is done to make sure you have access to it.

Just checking in - has this been resolved? Thanks!

Shreyanand commented 2 years ago

Hi @kgreenewald @onkarbhardwaj thanks for checking in, yes the issue has been resolved. The data and notebooks are now in this repository. You can run this notebook that let's you read data in ../data/processed/jupyterhub/*. Let me know if there is any issue with it.

kgreenewald commented 2 years ago

Hi @kgreenewald @onkarbhardwaj thanks for checking in, yes the issue has been resolved. The data and notebooks are now in this repository. You can run this notebook that let's you read data in ../data/processed/jupyterhub/*. Let me know if there is any issue with it.

Thanks!

aicoe-aiops / operate-first-jupyterhub-analysis

Causal inference for cloud operations data #14

Milestones