canonical / bundle-kubeflow

Charmed Kubeflow
Apache License 2.0
102 stars 49 forks source link

UnitsAvailable alerts are firing constantly #564

Closed facundofc closed 7 months ago

facundofc commented 1 year ago

On a recent deployment we're seeing these alerts firing all the time (literally, stuck to "firing"):

Looking at the up metric (which these alert rules query), we see that these are alternating between 1 and 0 every 45 seconds (this is a sample from the argo controller, query being: up{juju_application="argo-controller",juju_..."="..."}[10m]):

1 @1679925736.77
0 @1679925781.26
1 @1679925796.77
0 @1679925841.26
1 @1679925856.77
0 @1679925901.26
1 @1679925916.77
0 @1679925961.26
1 @1679925976.77
0 @1679926021.26
1 @1679926036.77
0 @1679926081.26
1 @1679926096.77
0 @1679926141.26
1 @1679926156.77
0 @1679926201.26
1 @1679926216.77
0 @1679926261.26
1 @1679926276.77
0 @1679926321.26

Incidentally to this flapping behavior, the duration for these alerts (at least for argo) is set to 0m, which seems a bit too sensitive for production envs.

i-chvets commented 1 year ago

FIx is merged.

facundofc commented 1 year ago

@i-chvets, the changes pushed by @dparv (changing the for: from 0 to 5m) are not a fix for this issue. I believe this should be reopened as the metrics are flapping or directly stuck to 0 (as in the dex-auth case). That needs to be addressed or pointed out here where it was addressed.

Thanks!

orfeas-k commented 1 year ago

Thank you @facundofc for letting us know about the issue not having been fixed. In order to better understand the issue, our team will need some more information.

  1. Are there specific steps to follow in order to reproduce the issue?
  2. Do you deploy these charms alone or through the bundle?
  3. Is there a specific a test environmnet or a CI where this has run? That would be of great help too
  4. What would be the expected behaviour? I understand that we don't want the flapping between 1 and 0, but what would we expect them to be? Also, should alerts should move from the firing stage to a next one?
nishant-dash commented 9 months ago

On a fresh installation of Kubeflow 1.8/stable I am seeing some units constantly firing unit is down, In particular these 4 (that have value 0)

up{agent_hostname="grafana-agent-k8s-0", instance="kubeflow_852b1fac-feb9-4e58-83d6-f89a640a1da6_grafana-agent-k8s_grafana-agent-k8s/0", job="juju_kubeflow_852b1fac-feb9-4e58-83d6-f89a640a1da6_grafana-agent-k8s_self-monitoring", juju_application="grafana-agent-k8s", juju_charm="grafana-agent-k8s", juju_model="kubeflow", juju_model_uuid="852b1fac-feb9-4e58-83d6-f89a640a1da6", juju_unit="grafana-agent-k8s/0"}
    1
up{instance="kubeflow_852b1fac-feb9-4e58-83d6-f89a640a1da6_argo-controller_argo-controller/4", job="juju_kubeflow_852b1fac_argo-controller_prometheus_scrape-4", juju_application="argo-controller", juju_charm="argo-controller", juju_model="kubeflow", juju_model_uuid="852b1fac-feb9-4e58-83d6-f89a640a1da6", juju_unit="argo-controller/4"}
    1
up{instance="kubeflow_852b1fac-feb9-4e58-83d6-f89a640a1da6_dex-auth_dex-auth/0", job="juju_kubeflow_852b1fac_dex-auth_prometheus_scrape-0", juju_application="dex-auth", juju_charm="dex-auth", juju_model="kubeflow", juju_model_uuid="852b1fac-feb9-4e58-83d6-f89a640a1da6", juju_unit="dex-auth/0"}
    0
up{instance="kubeflow_852b1fac-feb9-4e58-83d6-f89a640a1da6_jupyter-controller_jupyter-controller/0", job="juju_kubeflow_852b1fac_jupyter-controller_prometheus_scrape-0", juju_application="jupyter-controller", juju_charm="jupyter-controller", juju_model="kubeflow", juju_model_uuid="852b1fac-feb9-4e58-83d6-f89a640a1da6", juju_unit="jupyter-controller/0"}
    1
up{instance="kubeflow_852b1fac-feb9-4e58-83d6-f89a640a1da6_katib-controller_katib-controller/5", job="juju_kubeflow_852b1fac_katib-controller_prometheus_scrape_katib_controller_metrics-5", juju_application="katib-controller", juju_charm="katib-controller", juju_model="kubeflow", juju_model_uuid="852b1fac-feb9-4e58-83d6-f89a640a1da6", juju_unit="katib-controller/5"}
    1
up{instance="kubeflow_852b1fac-feb9-4e58-83d6-f89a640a1da6_kfp-api_kfp-api/0", job="juju_kubeflow_852b1fac_kfp-api_prometheus_scrape-0", juju_application="kfp-api", juju_charm="kfp-api", juju_model="kubeflow", juju_model_uuid="852b1fac-feb9-4e58-83d6-f89a640a1da6", juju_unit="kfp-api/0"}
    1
up{instance="kubeflow_852b1fac-feb9-4e58-83d6-f89a640a1da6_metacontroller-operator_metacontroller-operator/0", job="juju_kubeflow_852b1fac_metacontroller-operator_prometheus_scrape-0", juju_application="metacontroller-operator", juju_charm="metacontroller-operator", juju_model="kubeflow", juju_model_uuid="852b1fac-feb9-4e58-83d6-f89a640a1da6", juju_unit="metacontroller-operator/0"}
    0
up{instance="kubeflow_852b1fac-feb9-4e58-83d6-f89a640a1da6_minio_minio/0", job="juju_kubeflow_852b1fac_minio_prometheus_scrape_minio_metrics-0", juju_application="minio", juju_charm="minio", juju_model="kubeflow", juju_model_uuid="852b1fac-feb9-4e58-83d6-f89a640a1da6", juju_unit="minio/0"}
    0
up{instance="kubeflow_852b1fac-feb9-4e58-83d6-f89a640a1da6_seldon-controller-manager_seldon-controller-manager/0", job="juju_kubeflow_852b1fac_seldon-controller-manager_prometheus_scrape-0_354af41482c6f1a39a32f27c0760232d43b061fc310fafbb11cdf9c96089da64", juju_application="seldon-controller-manager", juju_charm="seldon-core", juju_model="kubeflow", juju_model_uuid="852b1fac-feb9-4e58-83d6-f89a640a1da6", juju_unit="seldon-controller-manager/0"}
    0
up{instance="kubeflow_852b1fac-feb9-4e58-83d6-f89a640a1da6_seldon-controller-manager_seldon-controller-manager/0", job="juju_kubeflow_852b1fac_seldon-controller-manager_prometheus_scrape-0_6a4146b8e841614be9619bdf209692efd31d6c8616f67001bd3badd38602d0ac", juju_application="seldon-controller-manager", juju_charm="seldon-core", juju_model="kubeflow", juju_model_uuid="852b1fac-feb9-4e58-83d6-f89a640a1da6", juju_unit="seldon-controller-manager/0"}
    1
up{instance="kubeflow_852b1fac-feb9-4e58-83d6-f89a640a1da6_training-operator_training-operator/0", job="juju_kubeflow_852b1fac_training-operator_prometheus_scrape-0", juju_application="training-operator", juju_charm="training-operator", juju_model="kubeflow", juju_model_uuid="852b1fac-feb9-4e58-83d6-f89a640a1da6", juju_unit="training-operator/0"}
nishant-dash commented 9 months ago

Graphing those 4 alerts, for Minio, seldon controller manager, metacontroller operator anddex auth, I see they have been 0 since the beginning of the deployment (i.e they were never 1 to begin with)

nishant-dash commented 9 months ago

@orfeas-k I can answer your questions in the context of KF 1.8/stable

  1. Just deploying cos integration with kubeflow. Nothing extra here
  2. Through the bundle
  3. There is not, but I have this running on a remote VM that I can probably get you access to.
  4. In this context, these 4 alerts mentioned above are stuck at 0 even though the units are available running and happy. I would expect them to be 1. Being in the state of constantly firing makes sense to me.
simskij commented 7 months ago

What @nishant-dash said. I'll copy in my reply from Matrix:

if the metrics that are being scraped alternate between 1 and 0, that's a sign that something is wrong either with your exporter, or with the application itself, as it's actually producing metrics saying that it's unhealthy. So, what happens here is that the alert rule - correctly - triggers and lets you know that something fishy is going on in your charm

syncronize-issues-to-jira[bot] commented 7 months ago

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5268.

This message was autogenerated

DnPlas commented 7 months ago

Thanks @simskij and @nishant-dash for the insights.

Steps to reproduce

  1. Deploy the latest supported versions of the offending charms. For example juju deploy minio --channel ckf-1.8/stable
  2. Deploy cos-lite following this guide
  3. Follow this guide to deploy other dependencies and add the required relations.
  4. Go to the prometheus dashboard > Alerts and list all the firing alerts image

I was able to reproduce the issue only for four (minio, metacontroller, dex, seldon) of the six applications that were originally reported in the issue. After some investigation, I did not observe any issues with the units and in fact all of the application's Pods were not restarted even once:

ubuntu@charm-dev-jammy:~/minio-operator$ kubectl get pods -nkubeflow
NAME                                            READY   STATUS    RESTARTS   AGE
modeloperator-84f4db8-gp67q                     1/1     Running   0          62m
minio-operator-0                                1/1     Running   0          61m
argo-controller-0                               2/2     Running   0          62m
grafana-agent-k8s-0                             2/2     Running   0          60m
dex-auth-0                                      2/2     Running   0          51m
istiod-6fcf5445fc-cnfzv                         1/1     Running   0          50m
istio-ingressgateway-workload-cb759595c-5ctwq   1/1     Running   0          50m
istio-pilot-0                                   1/1     Running   0          50m
istio-ingressgateway-0                          1/1     Running   0          50m
training-operator-0                             2/2     Running   0          46m
metacontroller-operator-charm-0                 1/1     Running   0          45m
metacontroller-operator-0                       1/1     Running   0          45m
jupyter-controller-0                            2/2     Running   0          44m
seldon-controller-manager-0                     2/2     Running   0          42m
minio-0                                         1/1     Running   0          6m29s

Because of the above, I started investigating each application's metrics endpoint to check what metrics were being scraped, and found out that for the four "failing" apps had misconfigurations, meaning the scraped metrics were not accurate or not reachable at alll. The following is a summary of my findings:

Minio

When trying to reach the metrics endpoint via svc:9000//minio/v2/metrics/cluster from within the minio Pod, I got a 403 Forbidden msg, which meant the endpoint must've need some authorization for reaching out. According to the docs and the minio Prometheus setup guide, we are supposed to set MINIO_PROMETHEUS_AUTH_TYPE="public" as an env variable in the container. After setting the variable, I was able to curl the endpoint and fetch metrics. This also prevented the alert to fire constantly.

Fix is in https://github.com/canonical/minio-operator/pull/157

Dex

The metrics endpoint cannot be reached from inside the Pod:

root@dex-auth-0:/# curl -v localhost:5558/metrics
*   Trying 127.0.0.1:5558...
* connect to 127.0.0.1 port 5558 failed: Connection refused
*   Trying ::1:5558...
* connect to ::1 port 5558 failed: Connection refused
* Failed to connect to localhost port 5558 after 0 ms: Connection refused
* Closing connection 0
curl: (7) Failed to connect to localhost port 5558 after 0 ms: Connection refused

This looks like the other cases, but the upstream documentation doesn't really show how to correctly reach metrics. This app needs more investigation.

Metacontroller

This charm deploys a StatefulSet with lightkube, which in turn creates the necessary resources for the operator to function correctly. In a current version of this charm, there is no Service attached to the Pod that gets created for running the workload, meaning there is no way of reaching its metrics endpoint. This also means that the MetricsEndpointProvider's target is not correct, as in the current state of the charm it is pointing at the charm and not the workload, causing the reported issue as there are no metrics endpoint in the charm.

NOTE: this charm is a bit special, as we are deploying the charm code in a Pod different than the one for the workload.

ubuntu@charm-dev-jammy:~$ kubectl get pods -nkubeflow | grep metacontroller
metacontroller-operator-charm-0                 1/1     Running   0          113m # <---- this is the workload, deployed via a StatefulSet
metacontroller-operator-0                       1/1     Running   0          114m # <---- this is the charm

Fix is in https://github.com/canonical/metacontroller-operator/pull/101

Seldon

For this one, the metrics path is not correctly configured. We say that the metrics are served in the svc:8080/self.config["executor-server-metrics-port-name"], which is a configuration value that does not exist in the charm's config.yaml file. The correct path is svc:8080/metrics.

Fix is in https://github.com/canonical/seldon-core-operator/pull/236

DnPlas commented 7 months ago

Part of the fix of this issue will also be to refactor the integration tests for all our charms that are integrated with prometheus, as these type of errors should've been caught by that.

DnPlas commented 7 months ago

Here are some updates after a bit more debugging

Dex

The was a misconfiguration on the telemetry settings for the dex workload. We need to set the telemetry value in dex's config file (etc/dex/config.docker.yaml). Alerts were firing for this charm because there was nothing listening on the metrics port that we passed to the MetricsEndpointProvider due to the missing configuration.

Fix in https://github.com/canonical/dex-auth-operator/pull/185

Seldon

Despite the metrics endpoints being reachable, even from inside the prometheus Pod, alerts are still firing constantly. This requires more investigation. Querying up{juju_application="seldon-controller-manager",juju_model="kubeflow",juju_model_uuid="47f52a8c-c74d-4072-8e25-4865db988512"}[5m] returns two objects showing different values:

up{instance="kubeflow_47f52a8c-c74d-4072-8e25-4865db988512_seldon-controller-manager_seldon-controller-manager/0", job="juju_kubeflow_47f52a8c_seldon-controller-manager_prometheus_scrape-0", juju_application="seldon-controller-manager", juju_charm="seldon-core", juju_model="kubeflow", juju_model_uuid="47f52a8c-c74d-4072-8e25-4865db988512", juju_unit="seldon-controller-manager/0"}  # <--- This has a bunch of 0

up{instance="kubeflow_47f52a8c-c74d-4072-8e25-4865db988512_seldon-controller-manager_seldon-controller-manager/0", job="juju_kubeflow_47f52a8c_seldon-controller-manager_prometheus_scrape_seldon_metrics-0", juju_application="seldon-controller-manager", juju_charm="seldon-core", juju_model="kubeflow", juju_model_uuid="47f52a8c-c74d-4072-8e25-4865db988512", juju_unit="seldon-controller-manager/0"}  # <--- This has a bunch of 1

Training and Jupyter

I have been able to reproduce the issue with these charms, but it's intermittent. The metrics endpoints are reachable and the port and path are correctly set and passed to the MetricsEndpointProvider, but sometimes the error happens and sometimes it doesn't. We need to investigate this case further.

EDIT: The metrics endpoints are not reachable from the prometheus scraper but the port and path are correctly set and passed to the MetricsEndpointProvider, which fires alerts because prometheus cannot scrape from the metrics endpoint.

DnPlas commented 7 months ago

Here are some updates after a bit more debugging

Training and Jupyter

For ensuring the metrics endpoint is actually reachable by the prometheus scraper, I have configured the Kubernetes Services of these applications to always expose the metrics port.

Fix for training-operator: https://github.com/canonical/training-operator/pull/151 Fix for jupyter-controller: https://github.com/canonical/notebook-operators/pull/332

Seldon

As mentioned before, there is a strange behaviour in seldon, as two scrape jobs are set for this application. It shows in two different places:

  1. Querying the up metric in the promehteus dashboard, which returns more than one item with different job names. The job that is always 1, should be the only existing job:
up{instance="kubeflow_47f52a8c-c74d-4072-8e25-4865db988512_seldon-controller-manager_seldon-controller-manager/0", job="juju_kubeflow_47f52a8c_seldon-controller-manager_prometheus_scrape-0", juju_application="seldon-controller-manager", juju_charm="seldon-core", juju_model="kubeflow", juju_model_uuid="47f52a8c-c74d-4072-8e25-4865db988512", juju_unit="seldon-controller-manager/0"}  # <--- This has a bunch of 0

up{instance="kubeflow_47f52a8c-c74d-4072-8e25-4865db988512_seldon-controller-manager_seldon-controller-manager/0", job
  1. Checking the relation data bag, I can observe the following jobs:
      scrape_jobs: '[{"metrics_path": "/metrics", "static_configs": [{"targets": ["seldon-controller-manager.seldon-only.svc:8080"]}],
        "scrape_interval": "30s"}, {"metrics_path": "/metrics", "static_configs":
        [{"targets": ["*:80"]}]}]'

    This can explain why the alert is firing constantly: there is a job that is scraping metrics from *:80/metrics which is not a valid metrics endpoint.

After a closer inspection to the charm code, I noticed that we are setting an extra job in L164. This extra job was introduced in https://github.com/canonical/seldon-core-operator/pull/94, but the reason to add this instead of just leaving the regular metrics endpoint is not clear. To fix the issue, we'll remove this extra line to avoid having a misconfigured job that is causing the UnitsUnavailable alert to constantly fire.

Fix for seldon: https://github.com/canonical/seldon-core-operator/pull/236

DnPlas commented 7 months ago

Argo

I was not able to reproduce the issue for argo-controller. I checked the relation data, the metrics endpoint and ensure it is reachable from the prometheus scraper. In the prometheus dashboard, I made a query for the unit and it stays at a constant 1. It is possible that recent changes in the charm code helped fixing the issue. Should this issue be present again for argo, let's file a separate issue.

DnPlas commented 7 months ago

Summary and conclusions

Future work

We have planned for improving our integration with COS for 24.04. Some of the things we can improved (based on my experience fixing this issue) are:

  1. Documentation

    • This guide is a bit outdated and commands don't work, we need to update them
    • The guide is missing other charms that are already integrated with COS, like istio-pilot
  2. Testing

    • We have a lot of duplications in our test cases for COS integration, we can put re-usable test-cases in charmed-kuebflow-chisme to keep consistency and accuracy
    • Test cases are not accurate, they are not asserting for things that we care about, like the unit is available
  3. Services

    • We are not always exposing the metrics port in the app Service, which may cause issues when trying to scrape metrics. Using the kubernetes_service_patch library can help alleviate this.

PRs

DnPlas commented 7 months ago

All PRs have been merged, we can close this issue. Feel free to re-open if this still an issue.