Closed nishant-dash closed 3 months ago
Thank you for reporting us your feedback!
The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5956.
This message was autogenerated
This should be a transient error shown by MLMD https://github.com/kubeflow/pipelines/issues/8733#issuecomment-1749475865
@nishant-dash could you confirm if the run completed successfully in the end?
Just saw it finishes, so will go ahead and close this issue
Re-opening this issue. We managed to reproduce it while bulk testing the Azure one-click deployment.
The message in our case was not transient. The red box kept staying there and also the UI was never getting updated for a pipeline run progress. After looking at the browser's dev-tools we saw that an istio-proxy
was responding with 503 errors for the requests that from the browser to MLMD.
A short term solution was to delete the envoy-xxxx-yyyy
pod and then the UI's requests succeeded
Trying to reproduce it in a Microk8s 1.29, Juju 3.4.4, I also see the following logs in the mlmd
pod
Defaulted container "mlmd" out of: mlmd, juju-pod-init (init)
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0717 10:22:12.936376 1 metadata_store_server_main.cc:577] Server listening on 0.0.0.0:8080
W0717 12:31:20.106307 11 metadata_store_service_impl.cc:239] GetContextType failed: No type found for query, name: `system.Pipeline`, version: `nullopt`
W0717 12:31:20.158107 11 metadata_store_service_impl.cc:239] GetContextType failed: No type found for query, name: `system.PipelineRun`, version: `nullopt`
but no error in KFP UI neither in Network
tab of browser's dev tools.
Here's the juju status output which from a revision perspective is identical to the reported one, apart from kserve-controller where mine has 573 and for kfp-schedwf 1466.
I 'll try again in Azure one-click deployments.
The first finding is that this happens due to a race taking place in envoy workload pod. (logs from the pod where the bug is reproduced)
ubuntu@dwzj5tbcjnboiBootstrapVm:~$ kubectl logs -n kubeflow envoy-6c7b4c8bc8-9qxhd -c envoy
[2024-07-22 08:09:56.645][1][info][main] [source/server/server.cc:249] initializing epoch 0 (hot restart version=11.104)
[2024-07-22 08:09:56.645][1][info][main] [source/server/server.cc:251] statically linked extensions:
[2024-07-22 08:09:56.645][1][info][main] [source/server/server.cc:253] access_loggers: envoy.file_access_log,envoy.http_grpc_access_log,envoy.tcp_grpc_access_log
[2024-07-22 08:09:56.645][1][info][main] [source/server/server.cc:256] filters.http: envoy.buffer,envoy.cors,envoy.csrf,envoy.ext_authz,envoy.fault,envoy.filters.http.adaptive_concurrency,envoy.filters.http.dynamic_forward_proxy,envoy.filters.http.grpc_http1_reverse_bridge,envoy.filters.http.grpc_stats,envoy.filters.http.header_to_metadata,envoy.filters.http.jwt_authn,envoy.filters.http.original_src,envoy.filters.http.rbac,envoy.filters.http.tap,envoy.grpc_http1_bridge,envoy.grpc_json_transcoder,envoy.grpc_web,envoy.gzip,envoy.health_check,envoy.http_dynamo_filter,envoy.ip_tagging,envoy.lua,envoy.rate_limit,envoy.router,envoy.squash
[2024-07-22 08:09:56.645][1][info][main] [source/server/server.cc:259] filters.listener: envoy.listener.http_inspector,envoy.listener.original_dst,envoy.listener.original_src,envoy.listener.proxy_protocol,envoy.listener.tls_inspector
[2024-07-22 08:09:56.645][1][info][main] [source/server/server.cc:262] filters.network: envoy.client_ssl_auth,envoy.echo,envoy.ext_authz,envoy.filters.network.dubbo_proxy,envoy.filters.network.mysql_proxy,envoy.filters.network.rbac,envoy.filters.network.sni_cluster,envoy.filters.network.thrift_proxy,envoy.filters.network.zookeeper_proxy,envoy.http_connection_manager,envoy.mongo_proxy,envoy.ratelimit,envoy.redis_proxy,envoy.tcp_proxy
[2024-07-22 08:09:56.645][1][info][main] [source/server/server.cc:264] stat_sinks: envoy.dog_statsd,envoy.metrics_service,envoy.stat_sinks.hystrix,envoy.statsd
[2024-07-22 08:09:56.645][1][info][main] [source/server/server.cc:266] tracers: envoy.dynamic.ot,envoy.lightstep,envoy.tracers.datadog,envoy.tracers.opencensus,envoy.tracers.xray,envoy.zipkin
[2024-07-22 08:09:56.645][1][info][main] [source/server/server.cc:269] transport_sockets.downstream: envoy.transport_sockets.alts,envoy.transport_sockets.raw_buffer,envoy.transport_sockets.tap,envoy.transport_sockets.tls,raw_buffer,tls
[2024-07-22 08:09:56.645][1][info][main] [source/server/server.cc:272] transport_sockets.upstream: envoy.transport_sockets.alts,envoy.transport_sockets.raw_buffer,envoy.transport_sockets.tap,envoy.transport_sockets.tls,raw_buffer,tls
[2024-07-22 08:09:56.645][1][info][main] [source/server/server.cc:278] buffer implementation: new
[2024-07-22 08:09:56.648][1][info][main] [source/server/server.cc:344] admin address: 0.0.0.0:9901
[2024-07-22 08:09:56.649][1][info][main] [source/server/server.cc:458] runtime: layers:
- name: base
static_layer:
{}
- name: admin
admin_layer:
{}
[2024-07-22 08:09:56.649][1][info][config] [source/server/configuration_impl.cc:62] loading 0 static secret(s)
[2024-07-22 08:09:56.649][1][info][config] [source/server/configuration_impl.cc:68] loading 0 cluster(s)
[2024-07-22 08:09:56.649][1][info][upstream] [source/common/upstream/cluster_manager_impl.cc:161] cm init: all clusters initialized
[2024-07-22 08:09:56.649][1][info][config] [source/server/configuration_impl.cc:72] loading 0 listener(s)
[2024-07-22 08:09:56.649][1][info][config] [source/server/configuration_impl.cc:97] loading tracing configuration
[2024-07-22 08:09:56.649][1][info][config] [source/server/configuration_impl.cc:117] loading stats sink configuration
[2024-07-22 08:09:56.649][1][info][main] [source/server/server.cc:528] all clusters initialized. initializing init manager
[2024-07-22 08:09:56.649][1][info][config] [source/server/listener_manager_impl.cc:578] all dependencies initialized. starting workers
[2024-07-22 08:09:56.650][1][info][main] [source/server/server.cc:549] starting main dispatch loop
[2024-07-22 08:24:56.653][1][info][main] [source/server/drain_manager_impl.cc:63] shutting down parent after drain
Compared to logs from a healthy envoy pod
╰─$ kl -n kubeflow envoy-6fc6d64c88-zlrs9 -c envoy
[2024-07-22 12:43:25.957][1][info][main] [source/server/server.cc:249] initializing epoch 0 (hot restart version=11.104)
[2024-07-22 12:43:25.959][1][info][main] [source/server/server.cc:251] statically linked extensions:
[2024-07-22 12:43:25.959][1][info][main] [source/server/server.cc:253] access_loggers: envoy.file_access_log,envoy.http_grpc_access_log,envoy.tcp_grpc_access_log
[2024-07-22 12:43:25.959][1][info][main] [source/server/server.cc:256] filters.http: envoy.buffer,envoy.cors,envoy.csrf,envoy.ext_authz,envoy.fault,envoy.filters.http.adaptive_concurrency,envoy.filters.http.dynamic_forward_proxy,envoy.filters.http.grpc_http1_reverse_bridge,envoy.filters.http.grpc_stats,envoy.filters.http.header_to_metadata,envoy.filters.http.jwt_authn,envoy.filters.http.original_src,envoy.filters.http.rbac,envoy.filters.http.tap,envoy.grpc_http1_bridge,envoy.grpc_json_transcoder,envoy.grpc_web,envoy.gzip,envoy.health_check,envoy.http_dynamo_filter,envoy.ip_tagging,envoy.lua,envoy.rate_limit,envoy.router,envoy.squash
[2024-07-22 12:43:25.959][1][info][main] [source/server/server.cc:259] filters.listener: envoy.listener.http_inspector,envoy.listener.original_dst,envoy.listener.original_src,envoy.listener.proxy_protocol,envoy.listener.tls_inspector
[2024-07-22 12:43:25.959][1][info][main] [source/server/server.cc:262] filters.network: envoy.client_ssl_auth,envoy.echo,envoy.ext_authz,envoy.filters.network.dubbo_proxy,envoy.filters.network.mysql_proxy,envoy.filters.network.rbac,envoy.filters.network.sni_cluster,envoy.filters.network.thrift_proxy,envoy.filters.network.zookeeper_proxy,envoy.http_connection_manager,envoy.mongo_proxy,envoy.ratelimit,envoy.redis_proxy,envoy.tcp_proxy
[2024-07-22 12:43:25.959][1][info][main] [source/server/server.cc:264] stat_sinks: envoy.dog_statsd,envoy.metrics_service,envoy.stat_sinks.hystrix,envoy.statsd
[2024-07-22 12:43:25.959][1][info][main] [source/server/server.cc:266] tracers: envoy.dynamic.ot,envoy.lightstep,envoy.tracers.datadog,envoy.tracers.opencensus,envoy.tracers.xray,envoy.zipkin
[2024-07-22 12:43:25.959][1][info][main] [source/server/server.cc:269] transport_sockets.downstream: envoy.transport_sockets.alts,envoy.transport_sockets.raw_buffer,envoy.transport_sockets.tap,envoy.transport_sockets.tls,raw_buffer,tls
[2024-07-22 12:43:25.959][1][info][main] [source/server/server.cc:272] transport_sockets.upstream: envoy.transport_sockets.alts,envoy.transport_sockets.raw_buffer,envoy.transport_sockets.tap,envoy.transport_sockets.tls,raw_buffer,tls
[2024-07-22 12:43:25.959][1][info][main] [source/server/server.cc:278] buffer implementation: new
[2024-07-22 12:43:25.963][1][warning][misc] [source/common/protobuf/utility.cc:282] Using deprecated option 'envoy.api.v2.listener.Filter.config' from file listener.proto. This configuration will be removed from Envoy soon. Please see https://www.envoyproxy.io/docs/envoy/latest/intro/deprecated for details.
[2024-07-22 12:43:25.964][1][info][main] [source/server/server.cc:344] admin address: 0.0.0.0:9901
[2024-07-22 12:43:25.968][1][info][main] [source/server/server.cc:458] runtime: layers:
- name: base
static_layer:
{}
- name: admin
admin_layer:
{}
[2024-07-22 12:43:25.968][1][info][config] [source/server/configuration_impl.cc:62] loading 0 static secret(s)
[2024-07-22 12:43:25.968][1][info][config] [source/server/configuration_impl.cc:68] loading 1 cluster(s)
[2024-07-22 12:43:25.974][1][info][config] [source/server/configuration_impl.cc:72] loading 1 listener(s)
[2024-07-22 12:43:25.976][1][warning][misc] [source/common/protobuf/utility.cc:282] Using deprecated option 'envoy.api.v2.route.CorsPolicy.allow_origin' from file route.proto. This configuration will be removed from Envoy soon. Please see https://www.envoyproxy.io/docs/envoy/latest/intro/deprecated for details.
[2024-07-22 12:43:25.977][1][info][config] [source/server/configuration_impl.cc:97] loading tracing configuration
[2024-07-22 12:43:25.977][1][info][config] [source/server/configuration_impl.cc:117] loading stats sink configuration
[2024-07-22 12:43:25.977][1][info][main] [source/server/server.cc:549] starting main dispatch loop
[2024-07-22 12:43:25.978][1][info][upstream] [source/common/upstream/cluster_manager_impl.cc:161] cm init: all clusters initialized
[2024-07-22 12:43:25.978][1][info][main] [source/server/server.cc:528] all clusters initialized. initializing init manager
[2024-07-22 12:43:25.978][1][info][config] [source/server/listener_manager_impl.cc:578] all dependencies initialized. starting workers
We see that when the envoy works as expected, the following line run
[source/common/protobuf/utility.cc:282] Using deprecated option 'envoy.api.v2.listener.Filter.config' from file listener.proto. This configuration will be removed from Envoy soon. Please see https://www.envoyproxy.io/docs/envoy/latest/intro/deprecated for details.
...
[2024-07-22 12:43:25.976][1][warning][misc] [source/common/protobuf/utility.cc:282] Using deprecated option 'envoy.api.v2.route.CorsPolicy.allow_origin' from file route.proto. This configuration will be removed from Envoy soon. Please see https://www.envoyproxy.io/docs/envoy/latest/intro/deprecated for details.
and we see the All clusters are initialized
message towards the end of the logs (compared to the unhealthy where this runs earlier). We also notice that the configuration code loads resources (before the cluster initialized message)
[2024-07-22 12:43:25.968][1][info][config] [source/server/configuration_impl.cc:62] loading 0 static secret(s)
[2024-07-22 12:43:25.968][1][info][config] [source/server/configuration_impl.cc:68] loading 1 cluster(s)
[2024-07-22 12:43:25.974][1][info][config] [source/server/configuration_impl.cc:72] loading 1 listener(s)
compared to when it's not working
[2024-07-22 08:09:56.649][1][info][config] [source/server/configuration_impl.cc:62] loading 0 static secret(s)
[2024-07-22 08:09:56.649][1][info][config] [source/server/configuration_impl.cc:68] loading 0 cluster(s)
[2024-07-22 08:09:56.649][1][info][upstream] [source/common/upstream/cluster_manager_impl.cc:161] cm init: all clusters initialized
[2024-07-22 08:09:56.649][1][info][config] [source/server/configuration_impl.cc:72] loading 0 listener(s)
and the unhealthy pod results in
[2024-07-22 08:09:56.650][1][info][main] [source/server/server.cc:549] starting main dispatch loop
[2024-07-22 08:24:56.653][1][info][main] [source/server/drain_manager_impl.cc:63] shutting down parent after drain
EDIT: the `shutting down parent after drain is also observed on healthy envoy pods.
Their envoy.json
file is identical
{"admin": {"access_log_path": "/tmp/admin_access.log", "address": {"socket_address": {"address": "0.0.0.0", "port_value": "9901"}}}, "static_resources": {"listeners": [{"name": "listener_0", "address": {"socket_address": {"address": "0.0.0.0", "port_value": 9090}}, "filter_chains": [{"filters": [{"name": "envoy.http_connection_manager", "config": {"stat_prefix": "ingress_http", "route_config": {"name": "local_route", "virtual_hosts": [{"name": "local_service", "domains": ["*"], "routes": [{"match": {"prefix": "/"}, "route": {"cluster": "metadata-grpc-service", "max_grpc_timeout": "60.000s"}}], "cors": {"allow_origin": ["*"], "allow_methods": "GET,PUT,DELETE,POST,OPTIONS", "allow_headers": "cache-control,content-transfer-encoding,content-type,grpc-timeout,keep-alive,user-agent,x-accept-content-transfer-encoding,x-accept-response-streaming,x-grpc-web,x-user-agent,custom-header-1", "expose_headers": "grpc-status,grpc-message,custom-header-1", "max_age": "1728000"}}]}, "http_filters": [{"name": "envoy.grpc_web"}, {"name": "envoy.cors"}, {"name": "envoy.router"}]}}]}]}], "clusters": [{"name": "metadata-grpc-service", "type": "LOGICAL_DNS", "connect_timeout": "30.000s", "hosts": [{"socket_address": {"address": "metadata-grpc-service", "port_value": 8080}}], "http2_protocol_options": {}}]}}
Trying to curl the envoy's endpoint, I see that it returns connection refused:
ubuntu@dwzj5tbcjnboiBootstrapVm:~$ kubectl exec envoy-operator-0 -n kubeflow -c juju-operator -- curl envoy.kubeflow.svc.cluster.local:9090 -H 'Content-Type: application/grpc-web-text' -I
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
curl: (7) Failed to connect to envoy.kubeflow.svc.cluster.local port 9090: Connection refused
command terminated with exit code 7
which is expected to return a 200
when envoy-operator works, as we are checking in integration tests.
╰─$ kubectl exec envoy-operator-0 -n kubeflow -c juju-operator -- curl envoy.kubeflow.svc.cluster.local:9090 -H 'Content-Type: application/grpc-web-text' -I
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
HTTP/1.1 200 OK
content-type: application/grpc-web+proto
grpc-status: 14
date: Tue, 30 Jul 2024 08:40:50 GMT
server: envoy
We didn't figure out what are the circumstances that trigger this. However, restarting the container (or deleting the pod) always resulted in a functional workload container. That's why, since 2.0 version is a PodSpec charm, we 'll be adding an httpGet
livenessProbe
, as described in podSpec yaml and kubernetes docs. This will curl the endpoint we also checking during integration tests, in order to ensure its functionality. If the endpoint is not responding with 200, then the livenessProbe will restart the container.
The fix has been merged in https://github.com/canonical/envoy-operator/pull/124 and promoted to 2.0/stable
.
Bug Description
running kfp_v2 integration test from https://github.com/canonical/charmed-kubeflow-uats/tree/main/tests, commit [0] experiment shows that
Cannot get MLMD objects from Metadata store.
[0] fe86b4e255c4c695376f70061a6a645301350d5a
To Reproduce
kfp-v2-integration.ipynb
Cannot get MLMD objects from Metadata store.
which details out asEnvironment
Relevant Log Output
Additional Context
No response