Open Shreyanand opened 3 years ago
cc @MichaelClifford @treeinrandomforest @kgreenewald @onkarbhardwaj Feel free to add questions, suggestions, and links to relevant resources here.
Given the dependency graph doc and the data collection notebook for jupyterhub, we can start exploring the ideas about what can we do with this data.
There are two problems that we thought could be interesting here:
Structure learning: If we are able to derive a structure from low level metrics and logs data, we could use it to understand complex applications that have a lot of pods interacting with each other. In the Jupyterhub example graph, that would mean determining which pods are connected and the direction of the arrows.
State prediction: Given the topology of the network, and the observed logs and metrics, compute the posterior distribution of underlying state variables. These state variables could be something like overall status of the pod, or status of pod memory, etc.
We can use this issue to brainstorm ideas and link resources. cc @TreeinRandomForest
cc @MichaelClifford @TreeinRandomForest @kgreenewald @onkarbhardwaj Feel free to add questions, suggestions, and links to relevant resources here.
Hi @Shreyanand can you re-invite @onkarbhardwaj and invite @133martie (Lee Martie) to the github? We're getting started with this data after getting back from various vacations. Thanks
@kgreenewald @onkarbhardwaj @133martie I see all of you are added to the repository now. Let me know if you need anything else :)
Hi @Shreyanand, when I try to run this notebook, I cannot download data from Prometheus because not having credentials. So instead, I tried to access the locations in which this notebook stores the data (../data/raw/jupyterhub/metrics/
) but the data does not seem to exist in the repo.
Could I get access to Prometheus or the data itself? Thanks a lot!
Hi @Shreyanand, when I try to run this notebook, I cannot download data from Prometheus because not having credentials. So instead, I tried to access the locations in which this notebook stores the data (
../data/raw/jupyterhub/metrics/
) but the data does not seem to exist in the repo.Could I get access to Prometheus or the data itself? Thanks a lot!
Hi @onkarbhardwaj, due to recent data licensing concerns, we are working on obfuscating identifiers in the data. I'll update this thread as soon as it is done to make sure you have access to it.
Hi @Shreyanand, when I try to run this notebook, I cannot download data from Prometheus because not having credentials. So instead, I tried to access the locations in which this notebook stores the data (
../data/raw/jupyterhub/metrics/
) but the data does not seem to exist in the repo. Could I get access to Prometheus or the data itself? Thanks a lot!Hi @onkarbhardwaj, due to recent data licensing concerns, we are working on obfuscating identifiers in the data. I'll update this thread as soon as it is done to make sure you have access to it.
Just checking in - has this been resolved? Thanks!
Hi @kgreenewald @onkarbhardwaj thanks for checking in, yes the issue has been resolved. The data and notebooks are now in this repository. You can run this notebook that let's you read data in ../data/processed/jupyterhub/*
. Let me know if there is any issue with it.
Hi @kgreenewald @onkarbhardwaj thanks for checking in, yes the issue has been resolved. The data and notebooks are now in this repository. You can run this notebook that let's you read data in
../data/processed/jupyterhub/*
. Let me know if there is any issue with it.
Thanks!
Cloud operational data refers to the data collected from deployed cloud applications and cloud infrastructure. Here, we focus on the data retrieved from the Operate First Openshift cluster. The smallest unit of an openshift cluster is a “pod” that performs one function. Several of these pods interact with each other to deliver a service. We retrieve metrics, logs, and events data from each of these pods that can be used for defining goals and problem statements.
One such goal we want to investigate is automated root cause analysis. In the case of failure of the cloud application, we want to be able to identify the root cause of the problem efficiently and reduce the recovery time. At scale, with many pods and services running, this can be challenging to do manually.
For the first step, we will study a simple example using the Jupyterhub application on the Operate First cluster. The initial approach we are going to explore for solving the problem of automated root cause analysis, is discovering the dependency graph between the pods that constitute the jupyterhub application and applying causal inference methods to it.
Issues 1, 2 highlight the specific task of understanding and retrieving data for the Jupyterhub application.
Milestones