Closed facundofc closed 7 months ago
FIx is merged.
@i-chvets, the changes pushed by @dparv (changing the for:
from 0 to 5m
) are not a fix for this issue. I believe this should be reopened as the metrics are flapping or directly stuck to 0 (as in the dex-auth case). That needs to be addressed or pointed out here where it was addressed.
Thanks!
Thank you @facundofc for letting us know about the issue not having been fixed. In order to better understand the issue, our team will need some more information.
firing
stage to a next one?On a fresh installation of Kubeflow 1.8/stable I am seeing some units constantly firing unit is down, In particular these 4 (that have value 0)
up{agent_hostname="grafana-agent-k8s-0", instance="kubeflow_852b1fac-feb9-4e58-83d6-f89a640a1da6_grafana-agent-k8s_grafana-agent-k8s/0", job="juju_kubeflow_852b1fac-feb9-4e58-83d6-f89a640a1da6_grafana-agent-k8s_self-monitoring", juju_application="grafana-agent-k8s", juju_charm="grafana-agent-k8s", juju_model="kubeflow", juju_model_uuid="852b1fac-feb9-4e58-83d6-f89a640a1da6", juju_unit="grafana-agent-k8s/0"}
1
up{instance="kubeflow_852b1fac-feb9-4e58-83d6-f89a640a1da6_argo-controller_argo-controller/4", job="juju_kubeflow_852b1fac_argo-controller_prometheus_scrape-4", juju_application="argo-controller", juju_charm="argo-controller", juju_model="kubeflow", juju_model_uuid="852b1fac-feb9-4e58-83d6-f89a640a1da6", juju_unit="argo-controller/4"}
1
up{instance="kubeflow_852b1fac-feb9-4e58-83d6-f89a640a1da6_dex-auth_dex-auth/0", job="juju_kubeflow_852b1fac_dex-auth_prometheus_scrape-0", juju_application="dex-auth", juju_charm="dex-auth", juju_model="kubeflow", juju_model_uuid="852b1fac-feb9-4e58-83d6-f89a640a1da6", juju_unit="dex-auth/0"}
0
up{instance="kubeflow_852b1fac-feb9-4e58-83d6-f89a640a1da6_jupyter-controller_jupyter-controller/0", job="juju_kubeflow_852b1fac_jupyter-controller_prometheus_scrape-0", juju_application="jupyter-controller", juju_charm="jupyter-controller", juju_model="kubeflow", juju_model_uuid="852b1fac-feb9-4e58-83d6-f89a640a1da6", juju_unit="jupyter-controller/0"}
1
up{instance="kubeflow_852b1fac-feb9-4e58-83d6-f89a640a1da6_katib-controller_katib-controller/5", job="juju_kubeflow_852b1fac_katib-controller_prometheus_scrape_katib_controller_metrics-5", juju_application="katib-controller", juju_charm="katib-controller", juju_model="kubeflow", juju_model_uuid="852b1fac-feb9-4e58-83d6-f89a640a1da6", juju_unit="katib-controller/5"}
1
up{instance="kubeflow_852b1fac-feb9-4e58-83d6-f89a640a1da6_kfp-api_kfp-api/0", job="juju_kubeflow_852b1fac_kfp-api_prometheus_scrape-0", juju_application="kfp-api", juju_charm="kfp-api", juju_model="kubeflow", juju_model_uuid="852b1fac-feb9-4e58-83d6-f89a640a1da6", juju_unit="kfp-api/0"}
1
up{instance="kubeflow_852b1fac-feb9-4e58-83d6-f89a640a1da6_metacontroller-operator_metacontroller-operator/0", job="juju_kubeflow_852b1fac_metacontroller-operator_prometheus_scrape-0", juju_application="metacontroller-operator", juju_charm="metacontroller-operator", juju_model="kubeflow", juju_model_uuid="852b1fac-feb9-4e58-83d6-f89a640a1da6", juju_unit="metacontroller-operator/0"}
0
up{instance="kubeflow_852b1fac-feb9-4e58-83d6-f89a640a1da6_minio_minio/0", job="juju_kubeflow_852b1fac_minio_prometheus_scrape_minio_metrics-0", juju_application="minio", juju_charm="minio", juju_model="kubeflow", juju_model_uuid="852b1fac-feb9-4e58-83d6-f89a640a1da6", juju_unit="minio/0"}
0
up{instance="kubeflow_852b1fac-feb9-4e58-83d6-f89a640a1da6_seldon-controller-manager_seldon-controller-manager/0", job="juju_kubeflow_852b1fac_seldon-controller-manager_prometheus_scrape-0_354af41482c6f1a39a32f27c0760232d43b061fc310fafbb11cdf9c96089da64", juju_application="seldon-controller-manager", juju_charm="seldon-core", juju_model="kubeflow", juju_model_uuid="852b1fac-feb9-4e58-83d6-f89a640a1da6", juju_unit="seldon-controller-manager/0"}
0
up{instance="kubeflow_852b1fac-feb9-4e58-83d6-f89a640a1da6_seldon-controller-manager_seldon-controller-manager/0", job="juju_kubeflow_852b1fac_seldon-controller-manager_prometheus_scrape-0_6a4146b8e841614be9619bdf209692efd31d6c8616f67001bd3badd38602d0ac", juju_application="seldon-controller-manager", juju_charm="seldon-core", juju_model="kubeflow", juju_model_uuid="852b1fac-feb9-4e58-83d6-f89a640a1da6", juju_unit="seldon-controller-manager/0"}
1
up{instance="kubeflow_852b1fac-feb9-4e58-83d6-f89a640a1da6_training-operator_training-operator/0", job="juju_kubeflow_852b1fac_training-operator_prometheus_scrape-0", juju_application="training-operator", juju_charm="training-operator", juju_model="kubeflow", juju_model_uuid="852b1fac-feb9-4e58-83d6-f89a640a1da6", juju_unit="training-operator/0"}
Graphing those 4 alerts, for Minio, seldon controller manager, metacontroller operator anddex auth, I see they have been 0 since the beginning of the deployment (i.e they were never 1 to begin with)
@orfeas-k I can answer your questions in the context of KF 1.8/stable
What @nishant-dash said. I'll copy in my reply from Matrix:
if the metrics that are being scraped alternate between 1 and 0, that's a sign that something is wrong either with your exporter, or with the application itself, as it's actually producing metrics saying that it's unhealthy. So, what happens here is that the alert rule - correctly - triggers and lets you know that something fishy is going on in your charm
Thank you for reporting us your feedback!
The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5268.
This message was autogenerated
Thanks @simskij and @nishant-dash for the insights.
juju deploy minio --channel ckf-1.8/stable
cos-lite
following this guideI was able to reproduce the issue only for four (minio, metacontroller, dex, seldon) of the six applications that were originally reported in the issue. After some investigation, I did not observe any issues with the units and in fact all of the application's Pods
were not restarted even once:
ubuntu@charm-dev-jammy:~/minio-operator$ kubectl get pods -nkubeflow
NAME READY STATUS RESTARTS AGE
modeloperator-84f4db8-gp67q 1/1 Running 0 62m
minio-operator-0 1/1 Running 0 61m
argo-controller-0 2/2 Running 0 62m
grafana-agent-k8s-0 2/2 Running 0 60m
dex-auth-0 2/2 Running 0 51m
istiod-6fcf5445fc-cnfzv 1/1 Running 0 50m
istio-ingressgateway-workload-cb759595c-5ctwq 1/1 Running 0 50m
istio-pilot-0 1/1 Running 0 50m
istio-ingressgateway-0 1/1 Running 0 50m
training-operator-0 2/2 Running 0 46m
metacontroller-operator-charm-0 1/1 Running 0 45m
metacontroller-operator-0 1/1 Running 0 45m
jupyter-controller-0 2/2 Running 0 44m
seldon-controller-manager-0 2/2 Running 0 42m
minio-0 1/1 Running 0 6m29s
Because of the above, I started investigating each application's metrics endpoint to check what metrics were being scraped, and found out that for the four "failing" apps had misconfigurations, meaning the scraped metrics were not accurate or not reachable at alll. The following is a summary of my findings:
When trying to reach the metrics endpoint via svc:9000//minio/v2/metrics/cluster
from within the minio Pod
, I got a 403 Forbidden
msg, which meant the endpoint must've need some authorization for reaching out.
According to the docs and the minio Prometheus setup guide, we are supposed to set MINIO_PROMETHEUS_AUTH_TYPE="public"
as an env variable in the container.
After setting the variable, I was able to curl the endpoint and fetch metrics. This also prevented the alert to fire constantly.
Fix is in https://github.com/canonical/minio-operator/pull/157
The metrics endpoint cannot be reached from inside the Pod
:
root@dex-auth-0:/# curl -v localhost:5558/metrics
* Trying 127.0.0.1:5558...
* connect to 127.0.0.1 port 5558 failed: Connection refused
* Trying ::1:5558...
* connect to ::1 port 5558 failed: Connection refused
* Failed to connect to localhost port 5558 after 0 ms: Connection refused
* Closing connection 0
curl: (7) Failed to connect to localhost port 5558 after 0 ms: Connection refused
This looks like the other cases, but the upstream documentation doesn't really show how to correctly reach metrics. This app needs more investigation.
This charm deploys a StatefulSet
with lightkube, which in turn creates the necessary resources for the operator to function correctly. In a current version of this charm, there is no Service
attached to the Pod
that gets created for running the workload, meaning there is no way of reaching its metrics endpoint. This also means that the MetricsEndpointProvider
's target is not correct, as in the current state of the charm it is pointing at the charm and not the workload, causing the reported issue as there are no metrics endpoint in the charm.
NOTE: this charm is a bit special, as we are deploying the charm code in a
Pod
different than the one for the workload.
ubuntu@charm-dev-jammy:~$ kubectl get pods -nkubeflow | grep metacontroller
metacontroller-operator-charm-0 1/1 Running 0 113m # <---- this is the workload, deployed via a StatefulSet
metacontroller-operator-0 1/1 Running 0 114m # <---- this is the charm
Fix is in https://github.com/canonical/metacontroller-operator/pull/101
For this one, the metrics path is not correctly configured. We say that the metrics are served in the svc:8080/self.config["executor-server-metrics-port-name"]
, which is a configuration value that does not exist in the charm's config.yaml
file. The correct path is svc:8080/metrics
.
Fix is in https://github.com/canonical/seldon-core-operator/pull/236
Part of the fix of this issue will also be to refactor the integration tests for all our charms that are integrated with prometheus, as these type of errors should've been caught by that.
Here are some updates after a bit more debugging
The was a misconfiguration on the telemetry settings for the dex workload. We need to set the telemetry
value in dex's config file (etc/dex/config.docker.yaml
). Alerts were firing for this charm because there was nothing listening on the metrics port that we passed to the MetricsEndpointProvider
due to the missing configuration.
Fix in https://github.com/canonical/dex-auth-operator/pull/185
Despite the metrics endpoints being reachable, even from inside the prometheus Pod
, alerts are still firing constantly. This requires more investigation.
Querying up{juju_application="seldon-controller-manager",juju_model="kubeflow",juju_model_uuid="47f52a8c-c74d-4072-8e25-4865db988512"}[5m]
returns two objects showing different values:
up{instance="kubeflow_47f52a8c-c74d-4072-8e25-4865db988512_seldon-controller-manager_seldon-controller-manager/0", job="juju_kubeflow_47f52a8c_seldon-controller-manager_prometheus_scrape-0", juju_application="seldon-controller-manager", juju_charm="seldon-core", juju_model="kubeflow", juju_model_uuid="47f52a8c-c74d-4072-8e25-4865db988512", juju_unit="seldon-controller-manager/0"} # <--- This has a bunch of 0
up{instance="kubeflow_47f52a8c-c74d-4072-8e25-4865db988512_seldon-controller-manager_seldon-controller-manager/0", job="juju_kubeflow_47f52a8c_seldon-controller-manager_prometheus_scrape_seldon_metrics-0", juju_application="seldon-controller-manager", juju_charm="seldon-core", juju_model="kubeflow", juju_model_uuid="47f52a8c-c74d-4072-8e25-4865db988512", juju_unit="seldon-controller-manager/0"} # <--- This has a bunch of 1
I have been able to reproduce the issue with these charms, but it's intermittent. The metrics endpoints are reachable and the port and path are correctly set and passed to the MetricsEndpointProvider
, but sometimes the error happens and sometimes it doesn't. We need to investigate this case further.
EDIT: The metrics endpoints are not reachable from the prometheus scraper but the port and path are correctly set and passed to the MetricsEndpointProvider
, which fires alerts because prometheus cannot scrape from the metrics endpoint.
Here are some updates after a bit more debugging
For ensuring the metrics endpoint is actually reachable by the prometheus scraper, I have configured the Kubernetes Services of these applications to always expose the metrics port.
Fix for training-operator: https://github.com/canonical/training-operator/pull/151 Fix for jupyter-controller: https://github.com/canonical/notebook-operators/pull/332
As mentioned before, there is a strange behaviour in seldon, as two scrape jobs are set for this application. It shows in two different places:
up{instance="kubeflow_47f52a8c-c74d-4072-8e25-4865db988512_seldon-controller-manager_seldon-controller-manager/0", job="juju_kubeflow_47f52a8c_seldon-controller-manager_prometheus_scrape-0", juju_application="seldon-controller-manager", juju_charm="seldon-core", juju_model="kubeflow", juju_model_uuid="47f52a8c-c74d-4072-8e25-4865db988512", juju_unit="seldon-controller-manager/0"} # <--- This has a bunch of 0
up{instance="kubeflow_47f52a8c-c74d-4072-8e25-4865db988512_seldon-controller-manager_seldon-controller-manager/0", job
scrape_jobs: '[{"metrics_path": "/metrics", "static_configs": [{"targets": ["seldon-controller-manager.seldon-only.svc:8080"]}],
"scrape_interval": "30s"}, {"metrics_path": "/metrics", "static_configs":
[{"targets": ["*:80"]}]}]'
This can explain why the alert is firing constantly: there is a job that is scraping metrics from *:80/metrics
which is not a valid metrics endpoint.
After a closer inspection to the charm code, I noticed that we are setting an extra job in L164. This extra job was introduced in https://github.com/canonical/seldon-core-operator/pull/94, but the reason to add this instead of just leaving the regular metrics endpoint is not clear. To fix the issue, we'll remove this extra line to avoid having a misconfigured job that is causing the UnitsUnavailable
alert to constantly fire.
Fix for seldon: https://github.com/canonical/seldon-core-operator/pull/236
I was not able to reproduce the issue for argo-controller. I checked the relation data, the metrics endpoint and ensure it is reachable from the prometheus scraper. In the prometheus dashboard, I made a query for the unit and it stays at a constant 1
. It is possible that recent changes in the charm code helped fixing the issue. Should this issue be present again for argo, let's file a separate issue.
Alerts were firing constantly mainly because:
Service
of the application not exposing the metrics port correctly, making it impossible to be reached.telemetry
value to be set in order to serve metrics).We were not catching this error in our CI because our test cases only check for the existence of the target. It is important that in the future we have re-usable test cases where we check for the actual alerts. In the case of UnitUnavailable
it's as easy as querying for the application and asserting the result is 1
.
We have planned for improving our integration with COS for 24.04. Some of the things we can improved (based on my experience fixing this issue) are:
Documentation
istio-pilot
Testing
Services
Service
, which may cause issues when trying to scrape metrics. Using the kubernetes_service_patch
library can help alleviate this.All PRs have been merged, we can close this issue. Feel free to re-open if this still an issue.
On a recent deployment we're seeing these alerts firing all the time (literally, stuck to "firing"):
Looking at the
up
metric (which these alert rules query), we see that these are alternating between 1 and 0 every 45 seconds (this is a sample from the argo controller, query being:up{juju_application="argo-controller",juju_..."="..."}[10m]
):Incidentally to this flapping behavior, the duration for these alerts (at least for argo) is set to
0m
, which seems a bit too sensitive for production envs.