canonical / kfp-operators

Kubeflow Pipelines Operators
Apache License 2.0
2 stars 12 forks source link

`apiserver` cannot find `minio` object store if minio application is named anything other than `minio` #369

Closed ca-scribner closed 10 months ago

ca-scribner commented 10 months ago

Bug Description

Note: This bug is against 1.8-updates-dev-branch, not against main

If kfp-api is deployed without a minio application called minio, it fails to find the object store. I think this is because there are issues with passing config to the running application? This is discussed more in #367. It appears that deploying minio as minio works because it happens to be the default name expected in the application(?).

This was introduced in #354, which refactored the configuration passing from file to environment variable but appears to have missed some variables like object storage host/port

To Reproduce

Deploy with bundle:

bundle.yaml ```yaml bundle: kubernetes applications: argo-controller: charm: argo-controller channel: latest/edge revision: 393 series: focal resources: oci-image: 343 scale: 1 constraints: arch=amd64 trust: true envoy: charm: envoy channel: latest/edge revision: 91 series: focal resources: oci-image: 91 scale: 1 constraints: arch=amd64 istio-ingressgateway: charm: istio-gateway channel: latest/edge revision: 715 series: focal scale: 1 options: kind: ingress constraints: arch=amd64 trust: true istio-pilot: charm: istio-pilot channel: latest/edge revision: 704 series: focal scale: 1 options: default-gateway: kubeflow-gateway constraints: arch=amd64 trust: true kfp-api: charm: kfp-api channel: latest/edge/pr-361 revision: 941 series: focal scale: 1 constraints: arch=amd64 trust: true kfp-db: charm: mysql-k8s channel: 8.0/edge revision: 110 resources: mysql-image: 106 scale: 1 constraints: arch=amd64 mem=2048 storage: database: kubernetes,1,1024M trust: true kfp-metadata-writer: charm: kfp-metadata-writer channel: latest/edge/pr-361 revision: 29 series: focal scale: 1 constraints: arch=amd64 trust: true kfp-persistence: charm: kfp-persistence channel: latest/edge/pr-361 revision: 945 series: focal scale: 1 constraints: arch=amd64 trust: true kfp-profile-controller: charm: kfp-profile-controller channel: latest/edge/pr-361 revision: 905 series: focal scale: 1 constraints: arch=amd64 trust: true kfp-schedwf: charm: kfp-schedwf channel: latest/edge/pr-361 revision: 958 series: focal scale: 1 constraints: arch=amd64 trust: true kfp-ui: charm: kfp-ui channel: latest/edge/pr-361 revision: 940 series: focal scale: 1 constraints: arch=amd64 trust: true kfp-viewer: charm: kfp-viewer channel: latest/edge/pr-361 revision: 970 series: focal scale: 1 constraints: arch=amd64 trust: true kfp-viz: charm: kfp-viz channel: latest/edge/pr-361 revision: 894 series: focal scale: 1 constraints: arch=amd64 trust: true kubeflow-dashboard: charm: kubeflow-dashboard channel: latest/edge revision: 448 series: focal resources: oci-image: 672 scale: 1 constraints: arch=amd64 trust: true kubeflow-profiles: charm: kubeflow-profiles channel: latest/edge revision: 350 series: focal resources: kfam-image: 565 profile-image: 563 scale: 1 constraints: arch=amd64 trust: true kubeflow-roles: charm: kubeflow-roles channel: latest/edge revision: 183 series: focal scale: 1 constraints: arch=amd64 trust: true minio2: charm: minio channel: latest/edge revision: 251 series: focal resources: oci-image: 533 scale: 1 constraints: arch=amd64 storage: minio-data: kubernetes,1,10240M mlmd: charm: mlmd channel: latest/edge revision: 124 series: focal resources: oci-image: 127 scale: 1 constraints: arch=amd64 storage: mlmd-data: kubernetes,1,10240M relations: - - argo-controller:object-storage - minio:object-storage - - kfp-api:relational-db - kfp-db:database - - kfp-api:kfp-api - kfp-persistence:kfp-api - - kfp-api:kfp-api - kfp-ui:kfp-api - - kfp-api:kfp-viz - kfp-viz:kfp-viz - - kfp-api:object-storage - minio:object-storage - - kfp-profile-controller:object-storage - minio:object-storage - - kfp-ui:object-storage - minio:object-storage - - kubeflow-profiles:kubeflow-profiles - kubeflow-dashboard:kubeflow-profiles - - kubeflow-dashboard:links - kfp-ui:dashboard-links - - mlmd:grpc - envoy:grpc - - mlmd:grpc - kfp-metadata-writer:grpc - - istio-pilot:istio-pilot - istio-ingressgateway:istio-pilot - - istio-pilot:ingress - kfp-ui:ingress - - envoy:ingress - istio-pilot:ingress - - istio-pilot:ingress - kubeflow-dashboard:ingress ```

then watch logs for apiserver using kubectl logs kfp-api-0 -c ml-pipeline-api-server

Environment

Tested with juju 3.1, microk8s 1.25-strict, and above bundle

Relevant log output

example failures:

kubectl logs kfp-api-0 -c ml-pipeline-api-server -f ``` 2023-11-01T17:20:17.487Z [pebble] HTTP API server listening on ":38813". 2023-11-01T17:20:17.487Z [pebble] Started daemon. 2023-11-01T17:22:31.396Z [pebble] GET /v1/services?names= 2.177115ms 200 2023-11-01T17:22:31.406Z [pebble] GET /v1/plan?format=yaml 1.493758ms 200 2023-11-01T17:22:48.929Z [pebble] GET /v1/plan?format=yaml 347.46µs 200 2023-11-01T17:22:48.939Z [pebble] POST /v1/layers 1.426821ms 200 2023-11-01T17:22:48.989Z [pebble] POST /v1/services 21.296452ms 202 2023-11-01T17:22:48.998Z [pebble] Service "apiserver" starting: bash -c 'sleep 1.1 && /bin/apiserver --config=/config --sampleconfig=/config/sample_config.json -logtostderr=true ' 2023-11-01T17:22:50.009Z [pebble] GET /v1/changes/1/wait?timeout=4.000s 1.014947211s 200 2023-11-01T17:22:50.515Z [apiserver] I1101 17:22:50.515110 23 client_manager.go:160] Initializing client manager 2023-11-01T17:22:50.518Z [apiserver] I1101 17:22:50.516518 23 config.go:57] Config DBConfig.ExtraParams not specified, skipping 2023-11-01T17:22:58.881Z [pebble] GET /v1/plan?format=yaml 1.721058ms 200 2023-11-01T17:22:58.890Z [pebble] POST /v1/layers 1.198154ms 200 2023-11-01T17:22:58.931Z [pebble] POST /v1/services 10.78643ms 202 2023-11-01T17:22:58.948Z [pebble] GET /v1/changes/2/wait?timeout=4.000s 14.352408ms 200 2023-11-01T17:24:45.092Z [pebble] GET /v1/plan?format=yaml 2.233712ms 200 2023-11-01T17:24:45.100Z [pebble] POST /v1/layers 917.223µs 200 2023-11-01T17:24:45.169Z [pebble] POST /v1/services 11.829742ms 202 2023-11-01T17:24:45.183Z [pebble] GET /v1/changes/3/wait?timeout=4.000s 10.954627ms 200 2023-11-01T17:24:45.462Z [pebble] GET /v1/checks?names=kfp-api-up 612.191µs 200 2023-11-01T17:25:12.022Z [pebble] GET /v1/plan?format=yaml 751.033µs 200 2023-11-01T17:29:06.638Z [pebble] GET /v1/plan?format=yaml 489.109µs 200 2023-11-01T17:29:06.642Z [pebble] POST /v1/layers 677.329µs 200 2023-11-01T17:29:06.660Z [pebble] POST /v1/services 4.159969ms 202 2023-11-01T17:29:06.671Z [pebble] GET /v1/changes/4/wait?timeout=4.000s 10.279515ms 200 2023-11-01T17:29:06.895Z [pebble] GET /v1/checks?names=kfp-api-up 85.649µs 200 2023-11-01T17:29:39.791Z [apiserver] F1101 17:29:39.791389 23 minio.go:76] Failed to create Minio client. Error: Error while creating minio client: Endpoint: does not follow ip address or domain name standards.: Endpoint: does not follow ip address or domain name standards. 2023-11-01T17:29:39.815Z [pebble] Service "apiserver" stopped unexpectedly with code 255 2023-11-01T17:29:39.815Z [pebble] Service "apiserver" on-failure action is "restart", waiting ~500ms before restart (backoff 1) 2023-11-01T17:29:40.343Z [pebble] Service "apiserver" starting: bash -c 'sleep 1.1 && /bin/apiserver --config=/config --sampleconfig=/config/sample_config.json -logtostderr=true ' 2023-11-01T17:29:41.544Z [apiserver] I1101 17:29:41.544909 63 client_manager.go:160] Initializing client manager 2023-11-01T17:29:41.545Z [apiserver] I1101 17:29:41.545109 63 config.go:57] Config DBConfig.ExtraParams not specified, skipping 2023-11-01T17:32:38.608Z [pebble] GET /v1/services?names= 109.461µs 200 2023-11-01T17:32:44.230Z [pebble] GET /v1/files?action=list&path=%2F 1.108685ms 200 2023-11-01T17:32:47.054Z [pebble] GET /v1/files?action=list&path=%2Fconfig 300.03µs 200 2023-11-01T17:33:01.778Z [pebble] POST /v1/exec 647.601µs 400 2023-11-01T17:33:13.603Z [pebble] POST /v1/exec 201.846µs 400 2023-11-01T17:33:19.414Z [pebble] POST /v1/exec 6.278075ms 202 2023-11-01T17:33:19.423Z [pebble] GET /v1/tasks/5/websocket/control 6.298648ms 200 2023-11-01T17:33:19.424Z [pebble] GET /v1/tasks/5/websocket/stdio 145.64µs 200 2023-11-01T17:33:28.385Z [pebble] GET /v1/changes/5/wait 8.960705919s 200 2023-11-01T17:34:06.644Z [pebble] Check "kfp-api-up" failure 1 (threshold 3): Get "http://localhost:8888/apis/v1beta1/healthz": dial tcp [::1]:8888: connect: connection refused 2023-11-01T17:34:36.692Z [pebble] GET /v1/plan?format=yaml 452.927µs 200 2023-11-01T17:34:36.697Z [pebble] POST /v1/layers 1.576055ms 200 2023-11-01T17:34:36.715Z [pebble] POST /v1/services 4.787856ms 202 2023-11-01T17:34:36.730Z [pebble] GET /v1/changes/6/wait?timeout=4.000s 13.826906ms 200 2023-11-01T17:34:36.971Z [pebble] GET /v1/checks?names=kfp-api-up 82.805µs 200 2023-11-01T17:36:22.532Z [apiserver] F1101 17:36:22.532228 63 minio.go:76] Failed to create Minio client. Error: Error while creating minio client: Endpoint: does not follow ip address or domain name standards.: Endpoint: does not follow ip address or domain name standards. 2023-11-01T17:36:22.536Z [pebble] Service "apiserver" stopped unexpectedly with code 255 2023-11-01T17:36:22.536Z [pebble] Service "apiserver" on-failure action is "restart", waiting ~500ms before restart (backoff 1) 2023-11-01T17:36:23.052Z [pebble] Service "apiserver" starting: bash -c 'sleep 1.1 && /bin/apiserver --config=/config --sampleconfig=/config/sample_config.json -logtostderr=true ' 2023-11-01T17:36:24.227Z [apiserver] I1101 17:36:24.227802 80 client_manager.go:160] Initializing client manager 2023-11-01T17:36:24.228Z [apiserver] I1101 17:36:24.228125 80 config.go:57] Config DBConfig.ExtraParams not specified, skipping 2023-11-01T17:37:37.011Z [pebble] POST /v1/exec 6.93502ms 202 2023-11-01T17:37:37.018Z [pebble] GET /v1/tasks/7/websocket/control 5.607235ms 200 2023-11-01T17:37:37.019Z [pebble] GET /v1/tasks/7/websocket/stdio 304.864µs 200 2023-11-01T17:37:37.034Z [pebble] GET /v1/changes/7/wait 13.254476ms 200 2023-11-01T17:37:39.932Z [pebble] POST /v1/exec 8.235057ms 202 2023-11-01T17:37:39.939Z [pebble] GET /v1/tasks/8/websocket/control 6.073866ms 200 2023-11-01T17:37:39.940Z [pebble] GET /v1/tasks/8/websocket/stdio 118.281µs 200 2023-11-01T17:37:39.958Z [pebble] GET /v1/changes/8/wait 16.469671ms 200 2023-11-01T17:37:44.222Z [pebble] POST /v1/exec 8.204988ms 202 2023-11-01T17:37:44.230Z [pebble] GET /v1/tasks/9/websocket/control 6.569159ms 200 2023-11-01T17:37:44.230Z [pebble] GET /v1/tasks/9/websocket/stdio 189.88µs 200 2023-11-01T17:37:44.250Z [pebble] GET /v1/changes/9/wait 17.820131ms 200 2023-11-01T17:37:49.390Z [pebble] POST /v1/exec 10.865234ms 202 2023-11-01T17:37:49.399Z [pebble] GET /v1/tasks/10/websocket/control 4.813009ms 200 2023-11-01T17:37:49.399Z [pebble] GET /v1/tasks/10/websocket/stdio 144.091µs 200 2023-11-01T17:37:49.416Z [pebble] GET /v1/changes/10/wait 16.009475ms 200 2023-11-01T17:37:53.373Z [pebble] POST /v1/exec 305.187µs 400 2023-11-01T17:37:55.617Z [pebble] POST /v1/exec 11.324035ms 202 2023-11-01T17:37:55.625Z [pebble] GET /v1/tasks/11/websocket/control 6.191422ms 200 2023-11-01T17:37:55.626Z [pebble] GET /v1/tasks/11/websocket/stdio 99.221µs 200 2023-11-01T17:37:55.643Z [pebble] GET /v1/changes/11/wait 16.937856ms 200 2023-11-01T17:37:59.223Z [pebble] POST /v1/exec 10.113631ms 202 2023-11-01T17:37:59.233Z [pebble] GET /v1/tasks/12/websocket/control 8.554426ms 200 2023-11-01T17:37:59.234Z [pebble] GET /v1/tasks/12/websocket/stdio 121.123µs 200 2023-11-01T17:37:59.254Z [pebble] GET /v1/changes/12/wait 19.098103ms 200 2023-11-01T17:39:08.324Z [pebble] GET /v1/plan?format=yaml 494.677µs 200 2023-11-01T17:39:08.329Z [pebble] POST /v1/layers 992.744µs 200 2023-11-01T17:39:08.363Z [pebble] POST /v1/services 18.185372ms 202 2023-11-01T17:39:08.395Z [pebble] GET /v1/changes/13/wait?timeout=4.000s 29.491044ms 200 2023-11-01T17:39:08.611Z [pebble] GET /v1/checks?names=kfp-api-up 94.444µs 200 2023-11-01T17:42:39.007Z [apiserver] F1101 17:42:39.006993 80 minio.go:76] Failed to create Minio client. Error: Error while creating minio client: Endpoint: does not follow ip address or domain name standards.: Endpoint: does not follow ip address or domain name standards. 2023-11-01T17:42:39.023Z [pebble] Service "apiserver" stopped unexpectedly with code 255 2023-11-01T17:42:39.023Z [pebble] Service "apiserver" on-failure action is "restart", waiting ~500ms before restart (backoff 1) 2023-11-01T17:42:39.563Z [pebble] Service "apiserver" starting: bash -c 'sleep 1.1 && /bin/apiserver --config=/config --sampleconfig=/config/sample_config.json -logtostderr=true ' 2023-11-01T17:42:40.747Z [apiserver] I1101 17:42:40.747558 102 client_manager.go:160] Initializing client manager 2023-11-01T17:42:40.751Z [apiserver] I1101 17:42:40.750801 102 config.go:57] Config DBConfig.ExtraParams not specified, skipping 2023-11-01T17:44:08.330Z [pebble] Check "kfp-api-up" failure 1 (threshold 3): Get "http://localhost:8888/apis/v1beta1/healthz": dial tcp [::1]:8888: connect: connection refused 2023-11-01T17:44:58.245Z [pebble] GET /v1/plan?format=yaml 447.912µs 200 2023-11-01T17:44:58.249Z [pebble] POST /v1/layers 574.189µs 200 2023-11-01T17:44:58.271Z [pebble] POST /v1/services 10.046961ms 202 2023-11-01T17:44:58.300Z [pebble] GET /v1/changes/14/wait?timeout=4.000s 27.561082ms 200 2023-11-01T17:44:58.527Z [pebble] GET /v1/checks?names=kfp-api-up 89.478µs 200 2023-11-01T17:48:45.915Z [pebble] GET /v1/changes?select=all 451.838µs 200 2023-11-01T17:49:18.498Z [pebble] Cannot get service-name for change 8 2023-11-01T17:49:18.498Z [pebble] Cannot get service-name for change 9 2023-11-01T17:49:18.498Z [pebble] Cannot get service-name for change 10 2023-11-01T17:49:18.498Z [pebble] Cannot get service-name for change 11 2023-11-01T17:49:18.498Z [pebble] Cannot get service-name for change 12 2023-11-01T17:49:18.498Z [pebble] Cannot get service-name for change 5 2023-11-01T17:49:18.498Z [pebble] Cannot get service-name for change 7 2023-11-01T17:49:18.498Z [pebble] GET /v1/changes?for=apiserver&select=all 302.607µs 200 2023-11-01T17:49:34.824Z [pebble] GET /v1/warnings 99.611µs 200 2023-11-01T17:49:40.323Z [apiserver] F1101 17:49:40.322972 102 minio.go:76] Failed to create Minio client. Error: Error while creating minio client: Endpoint: does not follow ip address or domain name standards.: Endpoint: does not follow ip address or domain name standards. 2023-11-01T17:49:40.328Z [pebble] Service "apiserver" stopped unexpectedly with code 255 2023-11-01T17:49:40.328Z [pebble] Service "apiserver" on-failure action is "restart", waiting ~500ms before restart (backoff 1) 2023-11-01T17:49:40.868Z [pebble] Service "apiserver" starting: bash -c 'sleep 1.1 && /bin/apiserver --config=/config --sampleconfig=/config/sample_config.json -logtostderr=true ' 2023-11-01T17:49:42.033Z [apiserver] I1101 17:49:42.033853 118 client_manager.go:160] Initializing client manager 2023-11-01T17:49:42.033Z [apiserver] I1101 17:49:42.033956 118 config.go:57] Config DBConfig.ExtraParams not specified, skipping 2023-11-01T17:49:45.167Z [pebble] GET /v1/warnings 84.51µs 200 2023-11-01T17:49:51.241Z [pebble] GET /v1/logs?n=30 520.153µs 200 2023-11-01T17:49:58.251Z [pebble] Check "kfp-api-up" failure 1 (threshold 3): Get "http://localhost:8888/apis/v1beta1/healthz": dial tcp [::1]:8888: connect: connection refused 2023-11-01T17:50:55.980Z [pebble] GET /v1/plan?format=yaml 761.581µs 200 2023-11-01T17:50:55.986Z [pebble] POST /v1/layers 858.636µs 200 2023-11-01T17:50:56.014Z [pebble] POST /v1/services 10.523863ms 202 2023-11-01T17:50:56.046Z [pebble] GET /v1/changes/15/wait?timeout=4.000s 30.09712ms 200 2023-11-01T17:50:56.295Z [pebble] GET /v1/checks?names=kfp-api-up 76.77µs 200 2023-11-01T17:55:55.987Z [pebble] Check "kfp-api-up" failure 1 (threshold 3): Get "http://localhost:8888/apis/v1beta1/healthz": dial tcp [::1]:8888: connect: connection refused 2023-11-01T17:56:11.867Z [apiserver] F1101 17:56:11.867047 118 minio.go:76] Failed to create Minio client. Error: Error while creating minio client: Endpoint: does not follow ip address or domain name standards.: Endpoint: does not follow ip address or domain name standards. 2023-11-01T17:56:11.891Z [pebble] Service "apiserver" stopped unexpectedly with code 255 2023-11-01T17:56:11.891Z [pebble] Service "apiserver" on-failure action is "restart", waiting ~500ms before restart (backoff 1) 2023-11-01T17:56:12.413Z [pebble] Service "apiserver" starting: bash -c 'sleep 1.1 && /bin/apiserver --config=/config --sampleconfig=/config/sample_config.json -logtostderr=true ' 2023-11-01T17:56:13.572Z [apiserver] I1101 17:56:13.572589 130 client_manager.go:160] Initializing client manager 2023-11-01T17:56:13.572Z [apiserver] I1101 17:56:13.572738 130 config.go:57] Config DBConfig.ExtraParams not specified, skipping 2023-11-01T17:56:33.465Z [pebble] GET /v1/plan?format=yaml 800.204µs 200 2023-11-01T17:56:33.471Z [pebble] POST /v1/layers 1.099992ms 200 2023-11-01T17:56:33.525Z [pebble] POST /v1/services 34.762332ms 202 2023-11-01T17:56:33.536Z [pebble] GET /v1/changes/16/wait?timeout=4.000s 9.024238ms 200 2023-11-01T17:56:33.790Z [pebble] GET /v1/checks?names=kfp-api-up 121.566µs 200 2023-11-01T18:00:41.319Z [pebble] GET /v1/plan?format=yaml 628.728µs 200 2023-11-01T18:00:41.323Z [pebble] POST /v1/layers 897.92µs 200 2023-11-01T18:00:41.345Z [pebble] POST /v1/services 9.736479ms 202 2023-11-01T18:00:41.374Z [pebble] GET /v1/changes/17/wait?timeout=4.000s 25.488785ms 200 2023-11-01T18:00:41.625Z [pebble] GET /v1/checks?names=kfp-api-up 114.593µs 200 2023-11-01T18:02:35.315Z [apiserver] F1101 18:02:35.315226 130 minio.go:76] Failed to create Minio client. Error: Error while creating minio client: Endpoint: does not follow ip address or domain name standards.: Endpoint: does not follow ip address or domain name standards. ```

Additional context

No response

ca-scribner commented 10 months ago

to summarise further, after #354 kfp-api's apiserver does not use the provided object store hostname/port to find the object store. Instead, because we do not pass any value, apiserver uses its built-in default of "minio" (or maybe "minio.kubeflow"?) to look for the object store. This means it accidentally works if charmed kubeflow includes a minio application specifically named minio, but it doesn't work if you do say juju deploy minio minio2; juju relate minio2 kfp-api

DnPlas commented 10 months ago

The pipelines apiserver will not use the application name, but the object store host and port, which is built from minio's Service, Namespace and Port. These are taken from env variables to form something like minio-service.kubeflow:9090. The apiserver will use a default minio service (see here for more details) in the absence of those:

// If the env vars do not exist, we guess that we are running in KFP multi user mode, so default minio service should be minio-service.kubeflow:9000.

The above is guaranteed by the minio charm as we have explicitly set the default Service name to minio-service, and since minio gets deployed in the kubeflow model, the namespace is also guaranteed. The application name does not really affect this behaviour. What is concerning, though, is the fact that the apiserver is trying to be restarted over and over again until it can actually see the minio service, but as described https://github.com/canonical/kfp-operators/issues/367 that does not seem to happen correctly.

ca-scribner commented 10 months ago

After talking with @DnPlas we think that, with #375 merged, we can close this.

The original issue here and dnplas's comment are sort of both correct. afaict they each refer to a different version of the minio charm (original aligns to a pre-KF 1.7 minio charm, and @dnplas comment is correct with a minio charm including this change. Either way though, with #375 merged both discussions should be settled.