canonical / kfp-operators

Kubeflow Pipelines Operators
Apache License 2.0
2 stars 12 forks source link

`kfp-persistence` has invalid pebble health check #514

Open orfeas-k opened 3 months ago

orfeas-k commented 3 months ago

Bug Description

kfp-persistence has a health check that checks for accessibility on a metrics endpoint. However, neither the charm implements a MetricsEndpointProvider neither upstream code seems to implement any metrics. This was introduced during the sidecar rewrite with baseCharm, which means that it could be a misconception about how we use health checks. The check thus should be removed.

To Reproduce

Deploy kfp-persistence and relate it to required dependencies

Environment

Juju 3.5, Microk8s 1.28

Relevant Log Output

─$ kfl kfp-persistence-0 -c persistenceagent -f
2024-06-12T08:52:35.461Z [pebble] HTTP API server listening on ":38813".
2024-06-12T08:52:35.461Z [pebble] Started daemon.
2024-06-12T08:52:54.189Z [pebble] GET /v1/plan?format=yaml 78.41µs 200
2024-06-12T08:52:54.190Z [pebble] POST /v1/layers 166.969µs 200
2024-06-12T08:53:05.499Z [pebble] GET /v1/notices?timeout=30s 30.000493302s 200
2024-06-12T08:53:35.500Z [pebble] GET /v1/notices?timeout=30s 30.001060881s 200
2024-06-12T08:54:05.501Z [pebble] GET /v1/notices?timeout=30s 30.000893481s 200
2024-06-12T08:54:13.983Z [pebble] POST /v1/files 3.690543ms 200
2024-06-12T08:54:14.005Z [pebble] GET /v1/plan?format=yaml 162.142µs 200
2024-06-12T08:54:14.007Z [pebble] POST /v1/layers 296.708µs 200
2024-06-12T08:54:14.011Z [pebble] POST /v1/services 4.262304ms 202
2024-06-12T08:54:14.014Z [pebble] GET /v1/notices?timeout=30s 8.512968209s 200
2024-06-12T08:54:14.015Z [pebble] Service "persistenceagent" starting: persistence_agent --logtostderr=true --namespace= --ttlSecondsAfterWorkflowFinish=86400 --numWorker=2 --mlPipelineAPIServerName=kfp-api.kubeflow
2024-06-12T08:54:14.096Z [persistenceagent] W0612 08:54:14.096332      15 client_config.go:617] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
2024-06-12T08:54:15.022Z [pebble] GET /v1/notices?after=2024-06-12T08%3A54%3A14.011973404Z&timeout=30s 1.007109898s 200
2024-06-12T08:54:15.022Z [pebble] GET /v1/changes/1/wait?timeout=4.000s 1.010184868s 200
2024-06-12T08:54:15.055Z [pebble] GET /v1/services 83.884µs 200
2024-06-12T08:54:17.391Z [pebble] GET /v1/services 49.967µs 200
2024-06-12T08:54:44.011Z [pebble] Check "persistenceagent-get" failure 1 (threshold 3): Get "http://localhost:8080/metrics": dial tcp [::1]:8080: connect: connection refused
2024-06-12T08:54:45.023Z [pebble] GET /v1/notices?after=2024-06-12T08%3A54%3A15.017313558Z&timeout=30s 30.00090974s 200
2024-06-12T08:55:14.008Z [pebble] Check "persistenceagent-get" failure 2 (threshold 3): Get "http://localhost:8080/metrics": dial tcp [::1]:8080: connect: connection refused
2024-06-12T08:55:15.024Z [pebble] GET /v1/notices?after=2024-06-12T08%3A54%3A15.017313558Z&timeout=30s 30.000130261s 200
2024-06-12T08:55:44.010Z [pebble] Check "persistenceagent-get" failure 3 (threshold 3): Get "http://localhost:8080/metrics": dial tcp [::1]:8080: connect: connection refused
2024-06-12T08:55:44.010Z [pebble] Check "persistenceagent-get" failure threshold 3 hit, triggering action
2024-06-12T08:55:45.025Z [pebble] GET /v1/notices?after=2024-06-12T08%3A54%3A15.017313558Z&timeout=30s 30.001000892s 200
2024-06-12T08:56:14.011Z [pebble] Check "persistenceagent-get" failure 4 (threshold 3): Get "http://localhost:8080/metrics": dial tcp [::1]:8080: connect: connection refused
2024-06-12T08:56:15.026Z [pebble] GET /v1/notices?after=2024-06-12T08%3A54%3A15.017313558Z&timeout=30s 30.000986384s 200
2024-06-12T08:56:16.458Z [persistenceagent] time="2024-06-12T08:56:16Z" level=fatal msg="Error creating ML pipeline API Server client: Failed to initialize pipeline client. Error: Waiting for ml pipeline API server failed after all attempts.: Get \"http://kfp-api.kubeflow:8888/apis/v1beta1/healthz\": dial tcp 10.152.183.187:8888: connect: connection refused: Waiting for ml pipeline API server failed after all attempts.: Get \"http://kfp-api.kubeflow:8888/apis/v1beta1/healthz\": dial tcp 10.152.183.187:8888: connect: connection refused"
2024-06-12T08:56:16.461Z [pebble] Service "persistenceagent" stopped unexpectedly with code 1
2024-06-12T08:56:16.461Z [pebble] Service "persistenceagent" on-failure action is "restart", waiting ~500ms before restart (backoff 1)
2024-06-12T08:56:17.002Z [pebble] Service "persistenceagent" starting: persistence_agent --logtostderr=true --namespace= --ttlSecondsAfterWorkflowFinish=86400 --numWorker=2 --mlPipelineAPIServerName=kfp-api.kubeflow
2024-06-12T08:56:17.033Z [persistenceagent] W0612 08:56:17.033566      29 client_config.go:617] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
2024-06-12T08:56:44.011Z [pebble] Check "persistenceagent-get" failure 5 (threshold 3): Get "http://localhost:8080/metrics": dial tcp [::1]:8080: connect: connection refused
2024-06-12T08:56:45.028Z [pebble] GET /v1/notices?after=2024-06-12T08%3A54%3A15.017313558Z&timeout=30s 30.000947153s 200
2024-06-12T08:57:14.010Z [pebble] Check "persistenceagent-get" failure 6 (threshold 3): Get "http://localhost:8080/metrics": dial tcp [::1]:8080: connect: connection refused
2024-06-12T08:57:15.029Z [pebble] GET /v1/notices?after=2024-06-12T08%3A54%3A15.017313558Z&timeout=30s 30.001004338s 200
2024-06-12T08:57:44.011Z [pebble] Check "persistenceagent-get" failure 7 (threshold 3): Get "http://localhost:8080/metrics": dial tcp [::1]:8080: connect: connection refused

Additional Context

No response

syncronize-issues-to-jira[bot] commented 3 months ago

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5863.

This message was autogenerated