canonical / kfp-operators

Kubeflow Pipelines Operators
Apache License 2.0
2 stars 12 forks source link

Can not retrieve the pipelines after upgrading Kubeflow from 1.8 to .19 #584

Open eleblebici opened 1 week ago

eleblebici commented 1 week ago

Bug Description

After upgrading the Kubeflow from 1.8 to 1.9, the pipelines on the UI are not retrievable. It is giving the following error in UI:

An error occurred
upstream connect error or disconnect/reset before headers. reset reason: remote connection failure, transport failure reason: delayed connect error: 111

We have the following log in the "istio-ingressgateway-workload":

2024-11-08T07:10:45.273942613Z [2024-11-08T07:10:44.383Z] "GET /pipeline/apis/v2beta1/pipelines?page_token=&page_size=10&sort_by=created_at%20desc&filter= HTTP/1.1" 503 URX via_upstream - "-" 0 152 61 58 "X.Y.0.149,X.Z.103.216" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:131.0) Gecko/20100101 Firefox/131.0" "43f03c5c-8bb4-421d-81c3-d2b4d91f190c" "test.com" "X.A.178.97:3000" outbound|3000||kfp-ui.kubeflow.svc.cluster.local X.A.178.87:34498 X.A.178.87:8080 X.A.103.216:59664 - -

And the following logs in the "ml-pipeline-ui" container:

2024-11-14T09:27:19.072Z [ml-pipeline-ui] (node:14) UnhandledPromiseRejectionWarning: FetchError: request to http://metadata/computeMetadata/v1/project/project-id failed, reason: getaddrinfo ENOTFOUND metadata
2024-11-14T09:27:19.072Z [ml-pipeline-ui]     at ClientRequest.<anonymous> (/server/node_modules/node-fetch/lib/index.js:1491:11)
2024-11-14T09:27:19.072Z [ml-pipeline-ui]     at ClientRequest.emit (events.js:400:28)
2024-11-14T09:27:19.072Z [ml-pipeline-ui]     at Socket.socketErrorListener (_http_client.js:475:9)
2024-11-14T09:27:19.072Z [ml-pipeline-ui]     at Socket.emit (events.js:400:28)
2024-11-14T09:27:19.072Z [ml-pipeline-ui]     at emitErrorNT (internal/streams/destroy.js:106:8)
2024-11-14T09:27:19.072Z [ml-pipeline-ui]     at emitErrorCloseNT (internal/streams/destroy.js:74:3)
2024-11-14T09:27:19.072Z [ml-pipeline-ui]     at processTicksAndRejections (internal/process/task_queues.js:82:21)
2024-11-14T09:27:19.072Z [ml-pipeline-ui] (node:14) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). To terminate the node process on unhandled promise rejection, use the CLI flag `--unhandled-rejections=strict` (see https://nodejs.org/api/cli.html#cli_unhandled_rejections_mode). (rejection id: 10)

It seems similar to this one: https://github.com/kubeflow/pipelines/issues/11247

We tried setting the environment DISABLE_GKE_METADATA for the ml-pipeline-ui container and re-applied the statefulset. But it is giving the same error though the environment seems to be added.

We think that it is because of pebble overwrites it: https://github.com/canonical/kfp-operators/blob/main/charms/kfp-ui/src/components/pebble_components.py#L61

To Reproduce

I could not reproduce that after upgrading from 1.8 to 1.9.

Environment

Charmed Kubeflow 1.9 Juju 3.4.5

Relevant Log Output

2024-11-14T09:27:19.072Z [ml-pipeline-ui] (node:14) UnhandledPromiseRejectionWarning: FetchError: request to http://metadata/computeMetadata/v1/project/project-id failed, reason: getaddrinfo ENOTFOUND metadata
2024-11-14T09:27:19.072Z [ml-pipeline-ui]     at ClientRequest.<anonymous> (/server/node_modules/node-fetch/lib/index.js:1491:11)
2024-11-14T09:27:19.072Z [ml-pipeline-ui]     at ClientRequest.emit (events.js:400:28)
2024-11-14T09:27:19.072Z [ml-pipeline-ui]     at Socket.socketErrorListener (_http_client.js:475:9)
2024-11-14T09:27:19.072Z [ml-pipeline-ui]     at Socket.emit (events.js:400:28)
2024-11-14T09:27:19.072Z [ml-pipeline-ui]     at emitErrorNT (internal/streams/destroy.js:106:8)
2024-11-14T09:27:19.072Z [ml-pipeline-ui]     at emitErrorCloseNT (internal/streams/destroy.js:74:3)
2024-11-14T09:27:19.072Z [ml-pipeline-ui]     at processTicksAndRejections (internal/process/task_queues.js:82:21)
2024-11-14T09:27:19.072Z [ml-pipeline-ui] (node:14) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). To terminate the node process on unhandled promise rejection, use the CLI flag `--unhandled-rejections=strict` (see https://nodejs.org/api/cli.html#cli_unhandled_rejections_mode). (rejection id: 10)

Additional Context

No response

syncronize-issues-to-jira[bot] commented 1 week ago

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-6551.

This message was autogenerated

kimwnasptd commented 1 week ago

thanks for the issue @eleblebici!

We'll cover this issue as also part of https://github.com/canonical/kfp-operators/issues/582

eleblebici commented 1 week ago

thank you @kimwnasptd

We are also observing "connection refuses" within the logs of "apiserver" container of kfp-api pod:

2024-11-18T09:25:00.356074040Z 2024-11-18T09:25:00.355Z [pebble] Check "kfp-api-up" failure 1 (threshold 3): Get "http://localhost:8888/apis/v1beta1/healthz": dial tcp [::1]:8888: connect: connection refused
2024-11-18T09:25:15.240695126Z 2024-11-18T09:25:15.240Z [pebble] GET /v1/notices?timeout=30s 30.000203043s 200
2024-11-18T09:25:38.144743678Z 2024-11-18T09:25:38.144Z [pebble] GET /v1/plan?format=yaml 352.29µs 200
2024-11-18T09:25:38.147001963Z 2024-11-18T09:25:38.146Z [pebble] POST /v1/layers 449.603µs 200
2024-11-18T09:25:38.161284312Z 2024-11-18T09:25:38.161Z [pebble] POST /v1/services 6.767914ms 202
2024-11-18T09:25:38.179053207Z 2024-11-18T09:25:38.178Z [pebble] GET /v1/changes/1130/wait?timeout=4.000s 16.999503ms 200
2024-11-18T09:25:38.301617525Z 2024-11-18T09:25:38.301Z [pebble] GET /v1/checks?names=kfp-api-up 74.329µs 200
2024-11-18T09:25:45.241515279Z 2024-11-18T09:25:45.241Z [pebble] GET /v1/notices?timeout=30s 30.0003448s 200
2024-11-18T09:26:15.242530092Z 2024-11-18T09:26:15.242Z [pebble] GET /v1/notices?timeout=30s 30.000440524s 200
2024-11-18T09:26:45.243597346Z 2024-11-18T09:26:45.243Z [pebble] GET /v1/notices?timeout=30s 30.000517511s 200
2024-11-18T09:27:15.244901542Z 2024-11-18T09:27:15.244Z [pebble] GET /v1/notices?timeout=30s 30.000961745s 200
2024-11-18T09:27:45.246629417Z 2024-11-18T09:27:45.246Z [pebble] GET /v1/notices?timeout=30s 30.001126268s 200
2024-11-18T09:28:15.247376020Z 2024-11-18T09:28:15.247Z [pebble] GET /v1/notices?timeout=30s 30.000312421s 200
2024-11-18T09:28:45.248478808Z 2024-11-18T09:28:45.248Z [pebble] GET /v1/notices?timeout=30s 30.000430172s 200
2024-11-18T09:29:15.249946977Z 2024-11-18T09:29:15.249Z [pebble] GET /v1/notices?timeout=30s 30.001073468s 200
2024-11-18T09:29:45.251188958Z 2024-11-18T09:29:45.250Z [pebble] GET /v1/notices?timeout=30s 30.000593107s 200
2024-11-18T09:29:56.725191419Z 2024-11-18T09:29:56.725Z [pebble] GET /v1/plan?format=yaml 467.559µs 200
2024-11-18T09:29:56.727727979Z 2024-11-18T09:29:56.727Z [pebble] POST /v1/layers 764.095µs 200
2024-11-18T09:29:56.743053371Z 2024-11-18T09:29:56.742Z [pebble] POST /v1/services 7.757464ms 202
2024-11-18T09:29:56.762801129Z 2024-11-18T09:29:56.762Z [pebble] GET /v1/changes/1131/wait?timeout=4.000s 18.692434ms 200
2024-11-18T09:29:56.881044630Z 2024-11-18T09:29:56.880Z [pebble] GET /v1/checks?names=kfp-api-up 62.061µs 200
2024-11-18T09:30:15.252629072Z 2024-11-18T09:30:15.252Z [pebble] GET /v1/notices?timeout=30s 30.000955823s 200
2024-11-18T09:30:45.253894521Z 2024-11-18T09:30:45.253Z [pebble] GET /v1/notices?timeout=30s 30.000901806s 200
2024-11-18T09:31:15.254995847Z 2024-11-18T09:31:15.254Z [pebble] GET /v1/notices?timeout=30s 30.000644512s 200
2024-11-18T09:31:45.255716666Z 2024-11-18T09:31:45.255Z [pebble] GET /v1/notices?timeout=30s 30.000355191s 200
2024-11-18T09:32:15.257141106Z 2024-11-18T09:32:15.256Z [pebble] GET /v1/notices?timeout=30s 30.000882644s 200
2024-11-18T09:32:45.257923226Z 2024-11-18T09:32:45.257Z [pebble] GET /v1/notices?timeout=30s 30.000353311s 200
2024-11-18T09:33:15.258904802Z 2024-11-18T09:33:15.258Z [pebble] GET /v1/notices?timeout=30s 30.000515506s 200
2024-11-18T09:33:45.260400775Z 2024-11-18T09:33:45.260Z [pebble] GET /v1/notices?timeout=30s 30.001048313s 200
2024-11-18T09:34:15.261257405Z 2024-11-18T09:34:15.260Z [pebble] GET /v1/notices?timeout=30s 30.000259529s 200
2024-11-18T09:34:27.525540762Z 2024-11-18T09:34:27.525Z [pebble] GET /v1/plan?format=yaml 464.379µs 200
2024-11-18T09:34:27.527813468Z 2024-11-18T09:34:27.527Z [pebble] POST /v1/layers 584.352µs 200
2024-11-18T09:34:27.542909628Z 2024-11-18T09:34:27.542Z [pebble] POST /v1/services 7.216855ms 202
2024-11-18T09:34:27.560369106Z 2024-11-18T09:34:27.560Z [pebble] GET /v1/changes/1132/wait?timeout=4.000s 16.63498ms 200
2024-11-18T09:34:27.697026695Z 2024-11-18T09:34:27.696Z [pebble] GET /v1/checks?names=kfp-api-up 78.81µs 200
2024-11-18T09:34:45.261731382Z 2024-11-18T09:34:45.261Z [pebble] GET /v1/notices?timeout=30s 30.000177102s 200
2024-11-18T09:35:15.262580201Z 2024-11-18T09:35:15.262Z [pebble] GET /v1/notices?timeout=30s 30.000331611s 200
2024-11-18T09:35:45.263374675Z 2024-11-18T09:35:45.263Z [pebble] GET /v1/notices?timeout=30s 30.000379449s 200
2024-11-18T09:36:15.264846259Z 2024-11-18T09:36:15.264Z [pebble] GET /v1/notices?timeout=30s 30.001166415s 200
2024-11-18T09:36:45.265606909Z 2024-11-18T09:36:45.265Z [pebble] GET /v1/notices?timeout=30s 30.000429633s 200
2024-11-18T09:37:15.267185778Z 2024-11-18T09:37:15.267Z [pebble] GET /v1/notices?timeout=30s 30.00115639s 200
2024-11-18T09:37:45.267898122Z 2024-11-18T09:37:45.267Z [pebble] GET /v1/notices?timeout=30s 30.000257938s 200
2024-11-18T09:38:15.269088643Z 2024-11-18T09:38:15.268Z [pebble] GET /v1/notices?timeout=30s 30.000861213s 200
2024-11-18T09:38:45.269962779Z 2024-11-18T09:38:45.269Z [pebble] GET /v1/notices?timeout=30s 30.000224698s 200
2024-11-18T09:39:15.271592348Z 2024-11-18T09:39:15.271Z [pebble] GET /v1/notices?timeout=30s 30.001144879s 200
2024-11-18T09:39:27.528805796Z 2024-11-18T09:39:27.528Z [pebble] Check "kfp-api-up" failure 1 (threshold 3): Get "http://localhost:8888/apis/v1beta1/healthz": dial tcp [::1]:8888: connect: connection refused
2024-11-18T09:39:38.143636451Z 2024-11-18T09:39:38.143Z [pebble] GET /v1/plan?format=yaml 1.070653ms 200
2024-11-18T09:39:38.145699505Z 2024-11-18T09:39:38.145Z [pebble] POST /v1/layers 476.134µs 200
2024-11-18T09:39:38.159889966Z 2024-11-18T09:39:38.159Z [pebble] POST /v1/services 6.188255ms 202
2024-11-18T09:39:38.179424891Z 2024-11-18T09:39:38.179Z [pebble] GET /v1/changes/1133/wait?timeout=4.000s 18.597678ms 200
2024-11-18T09:39:38.305824945Z 2024-11-18T09:39:38.305Z [pebble] GET /v1/checks?names=kfp-api-up 69.856µs 200
2024-11-18T09:39:45.273247169Z 2024-11-18T09:39:45.272Z [pebble] GET /v1/notices?timeout=30s 30.001087391s 200
2024-11-18T09:40:15.274639645Z 2024-11-18T09:40:15.274Z [pebble] GET /v1/notices?timeout=30s 30.001133004s 200
2024-11-18T09:40:45.276196762Z 2024-11-18T09:40:45.275Z [pebble] GET /v1/notices?timeout=30s 30.001056298s 200
2024-11-18T09:41:15.277080708Z 2024-11-18T09:41:15.276Z [pebble] GET /v1/notices?timeout=30s 30.000547183s 200
2024-11-18T09:41:45.278838353Z 2024-11-18T09:41:45.278Z [pebble] GET /v1/notices?timeout=30s 30.001200037s 200
2024-11-18T09:42:15.279938218Z 2024-11-18T09:42:15.279Z [pebble] GET /v1/notices?timeout=30s 30.000619892s 200
2024-11-18T09:42:45.281421971Z 2024-11-18T09:42:45.281Z [pebble] GET /v1/notices?timeout=30s 30.001074684s 200
2024-11-18T09:43:15.282181145Z 2024-11-18T09:43:15.281Z [pebble] GET /v1/notices?timeout=30s 30.000172484s 200
2024-11-18T09:43:45.282771011Z 2024-11-18T09:43:45.282Z [pebble] GET /v1/notices?timeout=30s 30.000404198s 200
2024-11-18T09:44:15.283705759Z 2024-11-18T09:44:15.283Z [pebble] GET /v1/notices?timeout=30s 30.000260526s 200
2024-11-18T09:44:38.147367537Z 2024-11-18T09:44:38.147Z [pebble] Check "kfp-api-up" failure 1 (threshold 3): Get "http://localhost:8888/apis/v1beta1/healthz": dial tcp [::1]:8888: connect: connection refused
2024-11-18T09:44:41.769449130Z 2024-11-18T09:44:41.769Z [pebble] GET /v1/plan?format=yaml 420.141µs 200
2024-11-18T09:44:41.771614762Z 2024-11-18T09:44:41.771Z [pebble] POST /v1/layers 510.247µs 200
2024-11-18T09:44:41.788080205Z 2024-11-18T09:44:41.787Z [pebble] POST /v1/services 9.071798ms 202
2024-11-18T09:44:41.805457471Z 2024-11-18T09:44:41.805Z [pebble] GET /v1/changes/1134/wait?timeout=4.000s 16.574574ms 200
2024-11-18T09:44:41.928889550Z 2024-11-18T09:44:41.928Z [pebble] GET /v1/checks?names=kfp-api-up 61.348µs 200
2024-11-18T09:44:45.284183195Z 2024-11-18T09:44:45.284Z [pebble] GET /v1/notices?timeout=30s 30.000180313s 200
2024-11-18T09:45:15.284833920Z 2024-11-18T09:45:15.284Z [pebble] GET /v1/notices?timeout=30s 30.000150228s 200
2024-11-18T09:45:45.285359496Z 2024-11-18T09:45:45.285Z [pebble] GET /v1/notices?timeout=30s 30.000170229s 200
2024-11-18T09:46:15.286796650Z 2024-11-18T09:46:15.286Z [pebble] GET /v1/notices?timeout=30s 30.00108555s 200
2024-11-18T09:46:45.287507954Z 2024-11-18T09:46:45.287Z [pebble] GET /v1/notices?timeout=30s 30.000253215s 200
2024-11-18T09:47:15.288570513Z 2024-11-18T09:47:15.288Z [pebble] GET /v1/notices?timeout=30s 30.000662851s 200
2024-11-18T09:47:45.289625116Z 2024-11-18T09:47:45.289Z [pebble] GET /v1/notices?timeout=30s 30.000570751s 200
2024-11-18T09:48:15.290508511Z 2024-11-18T09:48:15.290Z [pebble] GET /v1/notices?timeout=30s 30.000385218s 200
2024-11-18T09:48:45.291765852Z 2024-11-18T09:48:45.291Z [pebble] GET /v1/notices?timeout=30s 30.000666432s 200
2024-11-18T09:49:15.292941261Z 2024-11-18T09:49:15.292Z [pebble] GET /v1/notices?timeout=30s 30.000592343s 200
2024-11-18T09:49:41.774002260Z 2024-11-18T09:49:41.773Z [pebble] Check "kfp-api-up" failure 1 (threshold 3): Get "http://localhost:8888/apis/v1beta1/healthz": dial tcp [::1]:8888: connect: connection refused
2024-11-18T09:49:45.294052807Z 2024-11-18T09:49:45.293Z [pebble] GET /v1/notices?timeout=30s 30.000601274s 200
2024-11-18T09:49:59.259960010Z 2024-11-18T09:49:59.259Z [pebble] GET /v1/plan?format=yaml 1.460104ms 200
2024-11-18T09:49:59.262885474Z 2024-11-18T09:49:59.262Z [pebble] POST /v1/layers 732.729µs 200
2024-11-18T09:49:59.280996270Z 2024-11-18T09:49:59.280Z [pebble] POST /v1/services 8.803332ms 202
2024-11-18T09:49:59.300681853Z 2024-11-18T09:49:59.300Z [pebble] GET /v1/changes/1135/wait?timeout=4.000s 18.586788ms 200
2024-11-18T09:49:59.436543363Z 2024-11-18T09:49:59.436Z [pebble] GET /v1/checks?names=kfp-api-up 113.87µs 200
2024-11-18T09:50:15.295400823Z 2024-11-18T09:50:15.295Z [pebble] GET /v1/notices?timeout=30s 30.000664943s 200
2024-11-18T09:50:45.296678601Z 2024-11-18T09:50:45.296Z [pebble] GET /v1/notices?timeout=30s 30.000910485s 200
2024-11-18T09:51:15.297625988Z 2024-11-18T09:51:15.297Z [pebble] GET /v1/notices?timeout=30s 30.000503061s 200
2024-11-18T09:51:45.299288132Z 2024-11-18T09:51:45.299Z [pebble] GET /v1/notices?timeout=30s 30.001188866s 200
2024-11-18T09:52:15.300450234Z 2024-11-18T09:52:15.300Z [pebble] GET /v1/notices?timeout=30s 30.000836577s 200
2024-11-18T09:52:45.301719238Z 2024-11-18T09:52:45.301Z [pebble] GET /v1/notices?timeout=30s 30.000967983s 200
2024-11-18T09:53:15.302638899Z 2024-11-18T09:53:15.302Z [pebble] GET /v1/notices?timeout=30s 30.000478762s 200
2024-11-18T09:53:45.304049297Z 2024-11-18T09:53:45.303Z [pebble] GET /v1/notices?timeout=30s 30.000915182s 200
2024-11-18T09:54:15.304719690Z 2024-11-18T09:54:15.304Z [pebble] GET /v1/notices?timeout=30s 30.000336128s 200
2024-11-18T09:54:16.582884168Z 2024-11-18T09:54:16.582Z [pebble] GET /v1/plan?format=yaml 414.417µs 200
2024-11-18T09:54:16.584950560Z 2024-11-18T09:54:16.584Z [pebble] POST /v1/layers 467.407µs 200
2024-11-18T09:54:16.600830393Z 2024-11-18T09:54:16.600Z [pebble] POST /v1/services 7.842136ms 202
2024-11-18T09:54:16.619580417Z 2024-11-18T09:54:16.619Z [pebble] GET /v1/changes/1136/wait?timeout=4.000s 17.848238ms 200
2024-11-18T09:54:16.746705331Z 2024-11-18T09:54:16.746Z [pebble] GET /v1/checks?names=kfp-api-up 223.081µs 200
2024-11-18T09:54:45.306518562Z 2024-11-18T09:54:45.306Z [pebble] GET /v1/notices?timeout=30s 30.000789093s 200
2024-11-18T09:55:15.308235453Z 2024-11-18T09:55:15.308Z [pebble] GET /v1/notices?timeout=30s 30.001160195s 200
2024-11-18T09:55:45.309954095Z 2024-11-18T09:55:45.309Z [pebble] GET /v1/notices?timeout=30s 30.001143531s 200
2024-11-18T09:56:15.310674320Z 2024-11-18T09:56:15.310Z [pebble] GET /v1/notices?timeout=30s 30.000207561s 200
2024-11-18T09:56:45.312075195Z 2024-11-18T09:56:45.311Z [pebble] GET /v1/notices?timeout=30s 30.000969338s 200
2024-11-18T09:57:15.312797357Z 2024-11-18T09:57:15.312Z [pebble] GET /v1/notices?timeout=30s 30.000274586s 200
2024-11-18T09:57:45.314028896Z 2024-11-18T09:57:45.313Z [pebble] GET /v1/notices?timeout=30s 30.000842164s 200
2024-11-18T09:58:15.314759821Z 2024-11-18T09:58:15.314Z [pebble] GET /v1/notices?timeout=30s 30.000294789s 200
2024-11-18T09:58:45.316124941Z 2024-11-18T09:58:45.315Z [pebble] GET /v1/notices?timeout=30s 30.000478817s 200

I think the check is running in every 5 minutes and sometimes it is giving the "connection refused" error.

I've just wanted to share that though I am not sure if it is related.