canonical / bundle-kubeflow

Charmed Kubeflow
Apache License 2.0
103 stars 50 forks source link

Implement charms state grafana dashboard #877

Closed orfeas-k closed 5 months ago

orfeas-k commented 5 months ago

Context

Implement a grafana dashboard that will show the charm state for all kubeflow charms that provide metrics. This has been specced out here, with the difference that we will deploy it using kubeflow-dashboard charm to simplify its deployment.

What needs to get done

Add the generic grafana dashboard as part of kubeflow-dashboard charm.

Definition of Done

There is the grafana dashboard.

syncronize-issues-to-jira[bot] commented 5 months ago

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5591.

This message was autogenerated

orfeas-k commented 5 months ago

Regarding this dashboard, the spec described that we should have a dashboard that presents the "Uptime in % during the past 5 minutes". This would look like this:

image

However, this visualization has the following limitation: It shows the last value available that it received. So if the charm stops providing the metric completely (e.g. someone did a juju scale-application <app> 0), then it will continue to show 100% which is deceiving. And this is not fixed even when mapping Null, NaN or NoValue to 0. We had met the same limitation in the case of katib-controller where the controller stopped emitting current trials metric when it had no trials.

In the case of Katib-controller, we went with a time series visualization together with mapping noValue to 0. I tried this here and it works the same way. However, in this case we have more than two graph lines under the same panel which means that one charm's line is hidden under the other. That means that if two charms have an issue and start emitting 0 values (or no values) at the same exact time, then it will not be visible from the graph (without clicking on each application one by one) that it's more than one application that is down.

image

For example, in the visualization, if there was another app that went down at the same time with seldon-controller-manager (blue line), it wouldn't be visible. To summarize this, we cannot see how many lines of each graph line (thus up or down).

Solution

In order to resolve the above limitation, we went with a "state timeline" visualisation (the name sounds like a good fit,right?) and instead of showing an "Uptime in % during the past 5 minutes", we 'll be showing the applications' up metric each given time, interpreted as Up(1) or Down(0). This way, the user can see the state of each charm over the time. image