canonical / bundle-kubeflow

Charmed Kubeflow
Apache License 2.0
104 stars 50 forks source link

running kfp_v2 integration test and an experiment shows that `Cannot get MLMD objects from Metadata store.` #966

Closed nishant-dash closed 3 months ago

nishant-dash commented 4 months ago

Bug Description

running kfp_v2 integration test from https://github.com/canonical/charmed-kubeflow-uats/tree/main/tests, commit [0] experiment shows that Cannot get MLMD objects from Metadata store.

[0] fe86b4e255c4c695376f70061a6a645301350d5a

To Reproduce

Environment

Model     Controller  Cloud/Region          Version  SLA          Timestamp
kubeflow  manual      k8s-cloud/<REDACTED>  3.4.3    unsupported  15:37:17Z

SAAS        Status  Store  URL
grafana     active  local  admin/cos.grafana
prometheus  active  local  admin/cos.prometheus

App                        Version                  Status  Scale  Charm                    Channel          Rev  Address       Exposed  Message
admission-webhook                                   active      1  admission-webhook        1.8/stable       301  a.b.c.d  no       
argo-controller                                     active      1  argo-controller          3.3.10/stable    424  a.b.c.d   no       
dex-auth                                            active      1  dex-auth                 2.36/stable      422  a.b.c.d    no       
envoy                      res:oci-image@cc06b3e    active      1  envoy                    2.0/stable       194  a.b.c.d    no       
grafana-agent-k8s          0.40.4                   active      1  grafana-agent-k8s        latest/edge       80  a.b.c.d    no       logging-consumer: off
istio-ingressgateway                                active      1  istio-gateway            1.17/stable     1000  a.b.c.d  no       
istio-pilot                                         active      1  istio-pilot              1.17/stable     1011  a.b.c.d  no       
jupyter-controller                                  active      1  jupyter-controller       1.8/stable       849  a.b.c.d   no       
jupyter-ui                                          active      1  jupyter-ui               1.8/stable       858  a.b.c.d    no       
katib-controller           res:oci-image@31ccd70    active      1  katib-controller         0.16/stable      576  a.b.c.d  no       
katib-db                   8.0.36-0ubuntu0.22.04.1  active      1  mysql-k8s                8.0/stable       153  a.b.c.d   no       
katib-db-manager                                    active      1  katib-db-manager         0.16/stable      539  a.b.c.d  no       
katib-ui                                            active      1  katib-ui                 0.16/stable      422  a.b.c.d   no       
kfp-api                                             active      1  kfp-api                  2.0/stable      1283  a.b.c.d  no       
kfp-db                     8.0.36-0ubuntu0.22.04.1  active      1  mysql-k8s                8.0/stable       153  a.b.c.d   no       
kfp-metadata-writer                                 active      1  kfp-metadata-writer      2.0/stable       334  a.b.c.d  no       
kfp-persistence                                     active      1  kfp-persistence          2.0/stable      1291  a.b.c.d  no       
kfp-profile-controller                              active      1  kfp-profile-controller   2.0/stable      1315  a.b.c.d   no       
kfp-schedwf                                         active      1  kfp-schedwf              2.0/stable      1302  a.b.c.d    no       
kfp-ui                                              active      1  kfp-ui                   2.0/stable      1285  a.b.c.d   no       
kfp-viewer                                          active      1  kfp-viewer               2.0/stable      1317  a.b.c.d   no       
kfp-viz                                             active      1  kfp-viz                  2.0/stable      1235  a.b.c.d    no       
knative-eventing                                    active      1  knative-eventing         1.10/stable      353  a.b.c.d   no       
knative-operator                                    active      1  knative-operator         1.10/stable      328  a.b.c.d   no       
knative-serving                                     active      1  knative-serving          1.10/stable      409  a.b.c.d    no       
kserve-controller                                   active      1  kserve-controller        0.11/stable      523  a.b.c.d  no       
kubeflow-dashboard                                  active      1  kubeflow-dashboard       1.8/stable       582  a.b.c.d   no       
kubeflow-profiles                                   active      1  kubeflow-profiles        1.8/stable       355  a.b.c.d    no       
kubeflow-roles                                      active      1  kubeflow-roles           1.8/stable       187  a.b.c.d   no       
kubeflow-volumes           res:oci-image@2261827    active      1  kubeflow-volumes         1.8/stable       260  a.b.c.d    no       
metacontroller-operator                             active      1  metacontroller-operator  3.0/stable       252  a.b.c.d   no       
minio                      res:oci-image@1755999    active      1  minio                    ckf-1.8/stable   278  a.b.c.d  no       
mlflow-mysql               8.0.36-0ubuntu0.22.04.1  active      1  mysql-k8s                8.0/stable       153  a.b.c.d    no       
mlflow-server                                       active      1  mlflow-server            2.1/stable       466  a.b.c.d    no       
mlmd                       res:oci-image@44abc5d    active      1  mlmd                     1.14/stable      127  a.b.c.d   no       
oidc-gatekeeper                                     active      1  oidc-gatekeeper          ckf-1.8/stable   350  a.b.c.d    no       
pvcviewer-operator                                  active      1  pvcviewer-operator       1.8/stable        30  a.b.c.d  no       
resource-dispatcher                                 active      1  resource-dispatcher      1.0/stable        93  a.b.c.d   no       
seldon-controller-manager                           active      1  seldon-core              1.17/stable      664  a.b.c.d   no       
tensorboard-controller                              active      1  tensorboard-controller   1.8/stable       257  a.b.c.d   no       
tensorboards-web-app                                active      1  tensorboards-web-app     1.8/stable       245  a.b.c.d   no       
training-operator                                   active      1  training-operator        1.7/stable       347  a.b.c.d   no  

Relevant Log Output

$ kubectl logs -n kubeflow  mlmd-0
Defaulted container "mlmd" out of: mlmd, juju-pod-init (init)
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0701 08:27:19.354198     1 metadata_store_server_main.cc:577] Server listening on 0.0.0.0:8080
W0703 15:17:20.894055    10 metadata_store_service_impl.cc:239] GetContextType failed: No type found for query, name: `system.Pipeline`, version: `nullopt`
W0703 15:17:20.922753    10 metadata_store_service_impl.cc:239] GetContextType failed: No type found for query, name: `system.PipelineRun`, version: `nullopt`

Additional Context

No response

syncronize-issues-to-jira[bot] commented 4 months ago

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5956.

This message was autogenerated

kimwnasptd commented 4 months ago

This should be a transient error shown by MLMD https://github.com/kubeflow/pipelines/issues/8733#issuecomment-1749475865

@nishant-dash could you confirm if the run completed successfully in the end?

kimwnasptd commented 4 months ago

Just saw it finishes, so will go ahead and close this issue

kimwnasptd commented 4 months ago

Re-opening this issue. We managed to reproduce it while bulk testing the Azure one-click deployment.

The message in our case was not transient. The red box kept staying there and also the UI was never getting updated for a pipeline run progress. After looking at the browser's dev-tools we saw that an istio-proxy was responding with 503 errors for the requests that from the browser to MLMD.

A short term solution was to delete the envoy-xxxx-yyyy pod and then the UI's requests succeeded

orfeas-k commented 4 months ago

Reproducability

Trying to reproduce it in a Microk8s 1.29, Juju 3.4.4, I also see the following logs in the mlmd pod

Defaulted container "mlmd" out of: mlmd, juju-pod-init (init)
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0717 10:22:12.936376     1 metadata_store_server_main.cc:577] Server listening on 0.0.0.0:8080
W0717 12:31:20.106307    11 metadata_store_service_impl.cc:239] GetContextType failed: No type found for query, name: `system.Pipeline`, version: `nullopt`
W0717 12:31:20.158107    11 metadata_store_service_impl.cc:239] GetContextType failed: No type found for query, name: `system.PipelineRun`, version: `nullopt`

but no error in KFP UI neither in Network tab of browser's dev tools.

Here's the juju status output which from a revision perspective is identical to the reported one, apart from kserve-controller where mine has 573 and for kfp-schedwf 1466.

I 'll try again in Azure one-click deployments.

orfeas-k commented 4 months ago

Logs

The first finding is that this happens due to a race taking place in envoy workload pod. (logs from the pod where the bug is reproduced)

ubuntu@dwzj5tbcjnboiBootstrapVm:~$ kubectl logs -n kubeflow envoy-6c7b4c8bc8-9qxhd -c envoy
[2024-07-22 08:09:56.645][1][info][main] [source/server/server.cc:249] initializing epoch 0 (hot restart version=11.104)
[2024-07-22 08:09:56.645][1][info][main] [source/server/server.cc:251] statically linked extensions:
[2024-07-22 08:09:56.645][1][info][main] [source/server/server.cc:253]   access_loggers: envoy.file_access_log,envoy.http_grpc_access_log,envoy.tcp_grpc_access_log
[2024-07-22 08:09:56.645][1][info][main] [source/server/server.cc:256]   filters.http: envoy.buffer,envoy.cors,envoy.csrf,envoy.ext_authz,envoy.fault,envoy.filters.http.adaptive_concurrency,envoy.filters.http.dynamic_forward_proxy,envoy.filters.http.grpc_http1_reverse_bridge,envoy.filters.http.grpc_stats,envoy.filters.http.header_to_metadata,envoy.filters.http.jwt_authn,envoy.filters.http.original_src,envoy.filters.http.rbac,envoy.filters.http.tap,envoy.grpc_http1_bridge,envoy.grpc_json_transcoder,envoy.grpc_web,envoy.gzip,envoy.health_check,envoy.http_dynamo_filter,envoy.ip_tagging,envoy.lua,envoy.rate_limit,envoy.router,envoy.squash
[2024-07-22 08:09:56.645][1][info][main] [source/server/server.cc:259]   filters.listener: envoy.listener.http_inspector,envoy.listener.original_dst,envoy.listener.original_src,envoy.listener.proxy_protocol,envoy.listener.tls_inspector
[2024-07-22 08:09:56.645][1][info][main] [source/server/server.cc:262]   filters.network: envoy.client_ssl_auth,envoy.echo,envoy.ext_authz,envoy.filters.network.dubbo_proxy,envoy.filters.network.mysql_proxy,envoy.filters.network.rbac,envoy.filters.network.sni_cluster,envoy.filters.network.thrift_proxy,envoy.filters.network.zookeeper_proxy,envoy.http_connection_manager,envoy.mongo_proxy,envoy.ratelimit,envoy.redis_proxy,envoy.tcp_proxy
[2024-07-22 08:09:56.645][1][info][main] [source/server/server.cc:264]   stat_sinks: envoy.dog_statsd,envoy.metrics_service,envoy.stat_sinks.hystrix,envoy.statsd
[2024-07-22 08:09:56.645][1][info][main] [source/server/server.cc:266]   tracers: envoy.dynamic.ot,envoy.lightstep,envoy.tracers.datadog,envoy.tracers.opencensus,envoy.tracers.xray,envoy.zipkin
[2024-07-22 08:09:56.645][1][info][main] [source/server/server.cc:269]   transport_sockets.downstream: envoy.transport_sockets.alts,envoy.transport_sockets.raw_buffer,envoy.transport_sockets.tap,envoy.transport_sockets.tls,raw_buffer,tls
[2024-07-22 08:09:56.645][1][info][main] [source/server/server.cc:272]   transport_sockets.upstream: envoy.transport_sockets.alts,envoy.transport_sockets.raw_buffer,envoy.transport_sockets.tap,envoy.transport_sockets.tls,raw_buffer,tls
[2024-07-22 08:09:56.645][1][info][main] [source/server/server.cc:278] buffer implementation: new
[2024-07-22 08:09:56.648][1][info][main] [source/server/server.cc:344] admin address: 0.0.0.0:9901
[2024-07-22 08:09:56.649][1][info][main] [source/server/server.cc:458] runtime: layers:
  - name: base
    static_layer:
      {}
  - name: admin
    admin_layer:
      {}
[2024-07-22 08:09:56.649][1][info][config] [source/server/configuration_impl.cc:62] loading 0 static secret(s)
[2024-07-22 08:09:56.649][1][info][config] [source/server/configuration_impl.cc:68] loading 0 cluster(s)
[2024-07-22 08:09:56.649][1][info][upstream] [source/common/upstream/cluster_manager_impl.cc:161] cm init: all clusters initialized
[2024-07-22 08:09:56.649][1][info][config] [source/server/configuration_impl.cc:72] loading 0 listener(s)
[2024-07-22 08:09:56.649][1][info][config] [source/server/configuration_impl.cc:97] loading tracing configuration
[2024-07-22 08:09:56.649][1][info][config] [source/server/configuration_impl.cc:117] loading stats sink configuration
[2024-07-22 08:09:56.649][1][info][main] [source/server/server.cc:528] all clusters initialized. initializing init manager
[2024-07-22 08:09:56.649][1][info][config] [source/server/listener_manager_impl.cc:578] all dependencies initialized. starting workers
[2024-07-22 08:09:56.650][1][info][main] [source/server/server.cc:549] starting main dispatch loop
[2024-07-22 08:24:56.653][1][info][main] [source/server/drain_manager_impl.cc:63] shutting down parent after drain

Compared to logs from a healthy envoy pod

╰─$ kl -n kubeflow envoy-6fc6d64c88-zlrs9 -c envoy
[2024-07-22 12:43:25.957][1][info][main] [source/server/server.cc:249] initializing epoch 0 (hot restart version=11.104)
[2024-07-22 12:43:25.959][1][info][main] [source/server/server.cc:251] statically linked extensions:
[2024-07-22 12:43:25.959][1][info][main] [source/server/server.cc:253]   access_loggers: envoy.file_access_log,envoy.http_grpc_access_log,envoy.tcp_grpc_access_log
[2024-07-22 12:43:25.959][1][info][main] [source/server/server.cc:256]   filters.http: envoy.buffer,envoy.cors,envoy.csrf,envoy.ext_authz,envoy.fault,envoy.filters.http.adaptive_concurrency,envoy.filters.http.dynamic_forward_proxy,envoy.filters.http.grpc_http1_reverse_bridge,envoy.filters.http.grpc_stats,envoy.filters.http.header_to_metadata,envoy.filters.http.jwt_authn,envoy.filters.http.original_src,envoy.filters.http.rbac,envoy.filters.http.tap,envoy.grpc_http1_bridge,envoy.grpc_json_transcoder,envoy.grpc_web,envoy.gzip,envoy.health_check,envoy.http_dynamo_filter,envoy.ip_tagging,envoy.lua,envoy.rate_limit,envoy.router,envoy.squash
[2024-07-22 12:43:25.959][1][info][main] [source/server/server.cc:259]   filters.listener: envoy.listener.http_inspector,envoy.listener.original_dst,envoy.listener.original_src,envoy.listener.proxy_protocol,envoy.listener.tls_inspector
[2024-07-22 12:43:25.959][1][info][main] [source/server/server.cc:262]   filters.network: envoy.client_ssl_auth,envoy.echo,envoy.ext_authz,envoy.filters.network.dubbo_proxy,envoy.filters.network.mysql_proxy,envoy.filters.network.rbac,envoy.filters.network.sni_cluster,envoy.filters.network.thrift_proxy,envoy.filters.network.zookeeper_proxy,envoy.http_connection_manager,envoy.mongo_proxy,envoy.ratelimit,envoy.redis_proxy,envoy.tcp_proxy
[2024-07-22 12:43:25.959][1][info][main] [source/server/server.cc:264]   stat_sinks: envoy.dog_statsd,envoy.metrics_service,envoy.stat_sinks.hystrix,envoy.statsd
[2024-07-22 12:43:25.959][1][info][main] [source/server/server.cc:266]   tracers: envoy.dynamic.ot,envoy.lightstep,envoy.tracers.datadog,envoy.tracers.opencensus,envoy.tracers.xray,envoy.zipkin
[2024-07-22 12:43:25.959][1][info][main] [source/server/server.cc:269]   transport_sockets.downstream: envoy.transport_sockets.alts,envoy.transport_sockets.raw_buffer,envoy.transport_sockets.tap,envoy.transport_sockets.tls,raw_buffer,tls
[2024-07-22 12:43:25.959][1][info][main] [source/server/server.cc:272]   transport_sockets.upstream: envoy.transport_sockets.alts,envoy.transport_sockets.raw_buffer,envoy.transport_sockets.tap,envoy.transport_sockets.tls,raw_buffer,tls
[2024-07-22 12:43:25.959][1][info][main] [source/server/server.cc:278] buffer implementation: new
[2024-07-22 12:43:25.963][1][warning][misc] [source/common/protobuf/utility.cc:282] Using deprecated option 'envoy.api.v2.listener.Filter.config' from file listener.proto. This configuration will be removed from Envoy soon. Please see https://www.envoyproxy.io/docs/envoy/latest/intro/deprecated for details.
[2024-07-22 12:43:25.964][1][info][main] [source/server/server.cc:344] admin address: 0.0.0.0:9901
[2024-07-22 12:43:25.968][1][info][main] [source/server/server.cc:458] runtime: layers:
  - name: base
    static_layer:
      {}
  - name: admin
    admin_layer:
      {}
[2024-07-22 12:43:25.968][1][info][config] [source/server/configuration_impl.cc:62] loading 0 static secret(s)
[2024-07-22 12:43:25.968][1][info][config] [source/server/configuration_impl.cc:68] loading 1 cluster(s)
[2024-07-22 12:43:25.974][1][info][config] [source/server/configuration_impl.cc:72] loading 1 listener(s)
[2024-07-22 12:43:25.976][1][warning][misc] [source/common/protobuf/utility.cc:282] Using deprecated option 'envoy.api.v2.route.CorsPolicy.allow_origin' from file route.proto. This configuration will be removed from Envoy soon. Please see https://www.envoyproxy.io/docs/envoy/latest/intro/deprecated for details.
[2024-07-22 12:43:25.977][1][info][config] [source/server/configuration_impl.cc:97] loading tracing configuration
[2024-07-22 12:43:25.977][1][info][config] [source/server/configuration_impl.cc:117] loading stats sink configuration
[2024-07-22 12:43:25.977][1][info][main] [source/server/server.cc:549] starting main dispatch loop
[2024-07-22 12:43:25.978][1][info][upstream] [source/common/upstream/cluster_manager_impl.cc:161] cm init: all clusters initialized
[2024-07-22 12:43:25.978][1][info][main] [source/server/server.cc:528] all clusters initialized. initializing init manager
[2024-07-22 12:43:25.978][1][info][config] [source/server/listener_manager_impl.cc:578] all dependencies initialized. starting workers

Differences

We see that when the envoy works as expected, the following line run

[source/common/protobuf/utility.cc:282] Using deprecated option 'envoy.api.v2.listener.Filter.config' from file listener.proto. This configuration will be removed from Envoy soon. Please see https://www.envoyproxy.io/docs/envoy/latest/intro/deprecated for details.
...
[2024-07-22 12:43:25.976][1][warning][misc] [source/common/protobuf/utility.cc:282] Using deprecated option 'envoy.api.v2.route.CorsPolicy.allow_origin' from file route.proto. This configuration will be removed from Envoy soon. Please see https://www.envoyproxy.io/docs/envoy/latest/intro/deprecated for details.

and we see the All clusters are initialized message towards the end of the logs (compared to the unhealthy where this runs earlier). We also notice that the configuration code loads resources (before the cluster initialized message)

[2024-07-22 12:43:25.968][1][info][config] [source/server/configuration_impl.cc:62] loading 0 static secret(s)
[2024-07-22 12:43:25.968][1][info][config] [source/server/configuration_impl.cc:68] loading 1 cluster(s)
[2024-07-22 12:43:25.974][1][info][config] [source/server/configuration_impl.cc:72] loading 1 listener(s)

compared to when it's not working

[2024-07-22 08:09:56.649][1][info][config] [source/server/configuration_impl.cc:62] loading 0 static secret(s)
[2024-07-22 08:09:56.649][1][info][config] [source/server/configuration_impl.cc:68] loading 0 cluster(s)
[2024-07-22 08:09:56.649][1][info][upstream] [source/common/upstream/cluster_manager_impl.cc:161] cm init: all clusters initialized
[2024-07-22 08:09:56.649][1][info][config] [source/server/configuration_impl.cc:72] loading 0 listener(s)

and the unhealthy pod results in

[2024-07-22 08:09:56.650][1][info][main] [source/server/server.cc:549] starting main dispatch loop
[2024-07-22 08:24:56.653][1][info][main] [source/server/drain_manager_impl.cc:63] shutting down parent after drain

EDIT: the `shutting down parent after drain is also observed on healthy envoy pods.

JSON configuration

Their envoy.json file is identical

{"admin": {"access_log_path": "/tmp/admin_access.log", "address": {"socket_address": {"address": "0.0.0.0", "port_value": "9901"}}}, "static_resources": {"listeners": [{"name": "listener_0", "address": {"socket_address": {"address": "0.0.0.0", "port_value": 9090}}, "filter_chains": [{"filters": [{"name": "envoy.http_connection_manager", "config": {"stat_prefix": "ingress_http", "route_config": {"name": "local_route", "virtual_hosts": [{"name": "local_service", "domains": ["*"], "routes": [{"match": {"prefix": "/"}, "route": {"cluster": "metadata-grpc-service", "max_grpc_timeout": "60.000s"}}], "cors": {"allow_origin": ["*"], "allow_methods": "GET,PUT,DELETE,POST,OPTIONS", "allow_headers": "cache-control,content-transfer-encoding,content-type,grpc-timeout,keep-alive,user-agent,x-accept-content-transfer-encoding,x-accept-response-streaming,x-grpc-web,x-user-agent,custom-header-1", "expose_headers": "grpc-status,grpc-message,custom-header-1", "max_age": "1728000"}}]}, "http_filters": [{"name": "envoy.grpc_web"}, {"name": "envoy.cors"}, {"name": "envoy.router"}]}}]}]}], "clusters": [{"name": "metadata-grpc-service", "type": "LOGICAL_DNS", "connect_timeout": "30.000s", "hosts": [{"socket_address": {"address": "metadata-grpc-service", "port_value": 8080}}], "http2_protocol_options": {}}]}}
orfeas-k commented 3 months ago

Curl

Trying to curl the envoy's endpoint, I see that it returns connection refused:

ubuntu@dwzj5tbcjnboiBootstrapVm:~$ kubectl exec envoy-operator-0 -n kubeflow -c juju-operator -- curl envoy.kubeflow.svc.cluster.local:9090 -H 'Content-Type: application/grpc-web-text' -I 
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
curl: (7) Failed to connect to envoy.kubeflow.svc.cluster.local port 9090: Connection refused
command terminated with exit code 7

which is expected to return a 200 when envoy-operator works, as we are checking in integration tests.

╰─$ kubectl exec envoy-operator-0 -n kubeflow -c juju-operator -- curl envoy.kubeflow.svc.cluster.local:9090 -H 'Content-Type: application/grpc-web-text' -I
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
HTTP/1.1 200 OK
content-type: application/grpc-web+proto
grpc-status: 14
date: Tue, 30 Jul 2024 08:40:50 GMT
server: envoy
orfeas-k commented 3 months ago

Workaround

We didn't figure out what are the circumstances that trigger this. However, restarting the container (or deleting the pod) always resulted in a functional workload container. That's why, since 2.0 version is a PodSpec charm, we 'll be adding an httpGet livenessProbe, as described in podSpec yaml and kubernetes docs. This will curl the endpoint we also checking during integration tests, in order to ensure its functionality. If the endpoint is not responding with 200, then the livenessProbe will restart the container.

orfeas-k commented 3 months ago

The fix has been merged in https://github.com/canonical/envoy-operator/pull/124 and promoted to 2.0/stable.