Charm stuck in `WaitingStatus` because of `error initializing configuration '/envoy/envoy.yaml'`

DnPlas commented 5 days ago

Bug Description

It looks like the configuration in /envoy/envoy.yaml' is avoiding the service to start correctly, leaving the unit in WaitingStatus w/o a clear resolution path.

From the logs I can see Unable to parse JSON as proto (INVALID_ARGUMENT:(static_resources.listeners[0].filter_chains[0].filters[0].typed_config): invalid value Invalid type URL, unknown type: envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager for type Any, which suggests that this value is not recognized.

To Reproduce

Deploy juju envoy --channel latest/edge --trust
Deploy juju mlmd --channel latest/edge --trust
Relate juju relate envoy mlmd
Observe

Environment

microk8s 1.29-strict/stable
juju 3.4/stable (3.4.3)

Relevant Log Output

# ---- juju debug-log
unit-envoy-0: 21:22:06 ERROR unit.envoy/0.juju-log grpc:0: execute_components caught unhandled exception when executing configure_charm for envoy-component
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-envoy-0/charm/venv/charmed_kubeflow_chisme/components/charm_reconciler.py", line 92, in reconcile
    component_item.component.configure_charm(event)
  File "/var/lib/juju/agents/unit-envoy-0/charm/venv/charmed_kubeflow_chisme/components/component.py", line 50, in configure_charm
    self._configure_unit(event)
  File "/var/lib/juju/agents/unit-envoy-0/charm/venv/charmed_kubeflow_chisme/components/pebble_component.py", line 273, in _configure_unit
    self._update_layer()
  File "/var/lib/juju/agents/unit-envoy-0/charm/venv/charmed_kubeflow_chisme/components/pebble_component.py", line 284, in _update_layer
    container.replan()
  File "/var/lib/juju/agents/unit-envoy-0/charm/venv/ops/model.py", line 2211, in replan
    self._pebble.replan_services()
  File "/var/lib/juju/agents/unit-envoy-0/charm/venv/ops/pebble.py", line 1993, in replan_services
    return self._services_action('replan', [], timeout, delay)
  File "/var/lib/juju/agents/unit-envoy-0/charm/venv/ops/pebble.py", line 2090, in _services_action
    raise ChangeError(change.err, change)
ops.pebble.ChangeError: cannot perform the following tasks:
- Start service "envoy" (cannot start service: exited quickly with code 1)
----- Logs from task 0 -----
2024-06-27T21:22:06Z INFO Most recent service output:
    [2024-06-27 21:22:06.011][14][info][main] [source/server/server.cc:249] initializing epoch 0 (hot restart version=11.104)
    [2024-06-27 21:22:06.011][14][info][main] [source/server/server.cc:251] statically linked extensions:
    [2024-06-27 21:22:06.011][14][info][main] [source/server/server.cc:253]   access_loggers: envoy.file_access_log,envoy.http_grpc_access_log,envoy.tcp_grpc_access_log
    [2024-06-27 21:22:06.011][14][info][main] [source/server/server.cc:256]   filters.http: envoy.buffer,envoy.cors,envoy.csrf,envoy.ext_authz,envoy.fault,envoy.filters.http.adaptive_concurrency,envoy.filters.http.dynamic_forward_proxy,envoy.filters.http.grpc_http1_reverse_bridge,envoy.filters.http.grpc_stats,envoy.filters.http.header_to_metadata,envoy.filters.http.jwt_authn,envoy.filters.http.original_src,envoy.filters.http.rbac,envoy.filters.http.tap,envoy.grpc_http1_bridge,envoy.grpc_json_transcoder,envoy.grpc_web,envoy.gzip,envoy.health_check,envoy.http_dynamo_filter,envoy.ip_tagging,envoy.lua,envoy.rate_limit,envoy.router,envoy.squash
    [2024-06-27 21:22:06.011][14][info][main] [source/server/server.cc:259]   filters.listener: envoy.listener.http_inspector,envoy.listener.original_dst,envoy.listener.original_src,envoy.listener.proxy_protocol,envoy.listener.tls_inspector
    [2024-06-27 21:22:06.011][14][info][main] [source/server/server.cc:262]   filters.network: envoy.client_ssl_auth,envoy.echo,envoy.ext_authz,envoy.filters.network.dubbo_proxy,envoy.filters.network.mysql_proxy,envoy.filters.network.rbac,envoy.filters.network.sni_cluster,envoy.filters.network.thrift_proxy,envoy.filters.network.zookeeper_proxy,envoy.http_connection_manager,envoy.mongo_proxy,envoy.ratelimit,envoy.redis_proxy,envoy.tcp_proxy
    [2024-06-27 21:22:06.011][14][info][main] [source/server/server.cc:264]   stat_sinks: envoy.dog_statsd,envoy.metrics_service,envoy.stat_sinks.hystrix,envoy.statsd
    [2024-06-27 21:22:06.011][14][info][main] [source/server/server.cc:266]   tracers: envoy.dynamic.ot,envoy.lightstep,envoy.tracers.datadog,envoy.tracers.opencensus,envoy.tracers.xray,envoy.zipkin
    [2024-06-27 21:22:06.011][14][info][main] [source/server/server.cc:269]   transport_sockets.downstream: envoy.transport_sockets.alts,envoy.transport_sockets.raw_buffer,envoy.transport_sockets.tap,envoy.transport_sockets.tls,raw_buffer,tls
    [2024-06-27 21:22:06.011][14][info][main] [source/server/server.cc:272]   transport_sockets.upstream: envoy.transport_sockets.alts,envoy.transport_sockets.raw_buffer,envoy.transport_sockets.tap,envoy.transport_sockets.tls,raw_buffer,tls
    [2024-06-27 21:22:06.011][14][info][main] [source/server/server.cc:278] buffer implementation: new
    [2024-06-27 21:22:06.014][14][critical][main] [source/server/server.cc:95] error initializing configuration '/envoy/envoy.yaml': Unable to parse JSON as proto (INVALID_ARGUMENT:(static_resources.listeners[0].filter_chains[0].filters[0].typed_config): invalid value Invalid type URL, unknown type: envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager for type Any): {"static_resources":{"clusters":[{"load_assignment":{"endpoints":[{"lb_endpoints":[{"endpoint":{"address":{"socket_address":{"port_value":8080,"address":"metadata-grpc-service"}}}}]}],"cluster_name":"metadata-grpc"},"lb_policy":"round_robin","type":"logical_dns","typed_extension_protocol_options":{"envoy.extensions.upstreams.http.v3.HttpProtocolOptions":{"explicit_http_config":{"http2_protocol_options":{}},"@type":"type.googleapis.com/envoy.extensions.upstreams.http.v3.HttpProtocolOptions"}},"name":"metadata-cluster","connect_timeout":"30.0s"}],"listeners":[{"filter_chains":[{"filters":[{"typed_config":{"http_filters":[{"typed_config":{"@type":"type.googleapis.com/envoy.extensions.filters.http.grpc_web.v3.GrpcWeb"},"name":"envoy.filters.http.grpc_web"},{"typed_config":{"@type":"type.googleapis.com/envoy.extensions.filters.http.cors.v3.Cors"},"name":"envoy.filters.http.cors"},{"typed_config":{"@type":"type.googleapis.com/envoy.extensions.filters.http.router.v3.Router"},"name":"envoy.filters.http.router"}],"route_config":{"virtual_hosts":[{"routes":[{"typed_per_filter_config":{"envoy.filter.http.cors":{"max_age":"1728000","allow_headers":"keep-alive,user-agent,cache-control,content-type,content-transfer-encoding,custom-header-1,x-accept-content-transfer-encoding,x-accept-response-streaming,x-user-agent,x-grpc-web,grpc-timeout","allow_methods":"GET, PUT, DELETE, POST, OPTIONS","@type":"type.googleapis.com/envoy.extensions.filters.http.cors.v3.CorsPolicy","expose_headers":"custom-header-1,grpc-status,grpc-message","allow_origin_string_match":[{"safe_regex":{"regex":".*"}}]}},"match":{"prefix":"/"},"route":{"max_stream_duration":{"grpc_timeout_header_max":"0s"},"cluster":"metadata-cluster"}}],"name":"local_service","domains":["*"]}],"name":"local_route"},"stat_prefix":"ingress_http","@type":"type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager","codec_type":"auto"},"name":"envoy.filters.network.http_connection_manager"}]}],"name":"listener_0","address":{"socket_address":{"port_value":9090,"address":"0.0.0.0"}}}]},"admin":{"address":{"socket_address":{"port_value":9901,"address":"0.0.0.0"}},"access_log":{"typed_config":{"path":"/tmp/admin_access.log","@type":"type.googleapis.com/envoy.extensions.access_loggers.file.v3.FileAccessLog"},"name":"admin_access"}}}
    [2024-06-27 21:22:06.014][14][info][main] [source/server/server.cc:594] exiting
    Unable to parse JSON as proto (INVALID_ARGUMENT:(static_resources.listeners[0].filter_chains[0].filters[0].typed_config): invalid value Invalid type URL, unknown type: envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager for type Any): {"static_resources":{"clusters":[{"load_assignment":{"endpoints":[{"lb_endpoints":[{"endpoint":{"address":{"socket_address":{"port_value":8080,"address":"metadata-grpc-service"}}}}]}],"cluster_name":"metadata-grpc"},"lb_policy":"round_robin","type":"logical_dns","typed_extension_protocol_options":{"envoy.extensions.upstreams.http.v3.HttpProtocolOptions":{"explicit_http_config":{"http2_protocol_options":{}},"@type":"type.googleapis.com/envoy.extensions.upstreams.http.v3.HttpProtocolOptions"}},"name":"metadata-cluster","connect_timeout":"30.0s"}],"listeners":[{"filter_chains":[{"filters":[{"typed_config":{"http_filters":[{"typed_config":{"@type":"type.googleapis.com/envoy.extensions.filters.http.grpc_web.v3.GrpcWeb"},"name":"envoy.filters.http.grpc_web"},{"typed_config":{"@type":"type.googleapis.com/envoy.extensions.filters.http.cors.v3.Cors"},"name":"envoy.filters.http.cors"},{"typed_config":{"@type":"type.googleapis.com/envoy.extensions.filters.http.router.v3.Router"},"name":"envoy.filters.http.router"}],"route_config":{"virtual_hosts":[{"routes":[{"typed_per_filter_config":{"envoy.filter.http.cors":{"max_age":"1728000","allow_headers":"keep-alive,user-agent,cache-control,content-type,content-transfer-encoding,custom-header-1,x-accept-content-transfer-encoding,x-accept-response-streaming,x-user-agent,x-grpc-web,grpc-timeout","allow_methods":"GET, PUT, DELETE, POST, OPTIONS","@type":"type.googleapis.com/envoy.extensions.filters.http.cors.v3.CorsPolicy","expose_headers":"custom-header-1,grpc-status,grpc-message","allow_origin_string_match":[{"safe_regex":{"regex":".*"}}]}},"match":{"prefix":"/"},"route":{"max_stream_duration":{"grpc_timeout_header_max":"0s"},"cluster":"metadata-cluster"}}],"name":"local_service","domains":["*"]}],"name":"local_route"},"stat_prefix":"ingress_http","@type":"type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager","codec_type":"auto"},"name":"envoy.filters.network.http_connection_manager"}]}],"name":"listener_0","address":{"socket_address":{"port_value":9090,"address":"0.0.0.0"}}}]},"admin":{"address":{"socket_address":{"port_value":9901,"address":"0.0.0.0"}},"access_log":{"typed_config":{"path":"/tmp/admin_access.log","@type":"type.googleapis.com/envoy.extensions.access_loggers.file.v3.FileAccessLog"},"name":"admin_access"}}}
2024-06-27T21:22:06Z ERROR cannot start service: exited quickly with code 1
-----

# ---- juju status

Model     Controller  Cloud/Region        Version  SLA          Timestamp
kubeflow  uk8s-343    microk8s/localhost  3.4.3    unsupported  21:30:29Z

App    Version  Status   Scale  Charm  Channel      Rev  Address         Exposed  Message
envoy           waiting      1  envoy  latest/edge  230  10.152.183.165  no       installing agent
mlmd            active       1  mlmd   latest/edge  197  10.152.183.98   no

Unit      Workload  Agent  Address      Ports  Message
envoy/0*  waiting   idle   10.1.60.154         [envoy-component] Waiting for Pebble services (envoy).  If this persists, it could be a blocking configuration error.
mlmd/0*   active    idle   10.1.60.153

Additional Context

Strangely enough, this is not being captured by envoy's CI - I have ran two attempts on HEAD and they both succeed. This behaviour was caught by the kfp-operators CI here. I was also able to reproduce it locally.

syncronize-issues-to-jira[bot] commented 5 days ago

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5907.

This message was autogenerated

DnPlas commented 5 days ago

envoy 2.0/stable

In a model with envoy 2.0/stable this issue is not present:

Model     Controller  Cloud/Region        Version  SLA          Timestamp
kubeflow  uk8s-343    microk8s/localhost  3.4.3    unsupported  21:51:30Z

App                   Version                Status  Scale  Charm          Channel       Rev  Address         Exposed  Message
envoy                 res:oci-image@cc06b3e  active      1  envoy          2.0/stable    194  10.152.183.154  no
istio-ingressgateway                         active      1  istio-gateway  1.17/stable  1000  10.152.183.112  no
istio-pilot                                  active      1  istio-pilot    1.17/stable  1011  10.152.183.166  no
mlmd                  res:oci-image@44abc5d  active      1  mlmd           1.14/stable   127  10.152.183.167  no

Unit                     Workload  Agent  Address      Ports          Message
envoy/1*                 active    idle   10.1.60.145  9090,9901/TCP
istio-ingressgateway/0*  active    idle   10.1.60.158
istio-pilot/0*           active    idle   10.1.60.156
mlmd/1*                  active    idle   10.1.60.157  8080/TCP

Integration provider     Requirer                          Interface          Type     Message
istio-pilot:ingress      envoy:ingress                     ingress            regular
istio-pilot:istio-pilot  istio-ingressgateway:istio-pilot  k8s-service        regular
istio-pilot:peers        istio-pilot:peers                 istio_pilot_peers  peer
mlmd:grpc                envoy:grpc                        grpc               regular

I noticed that in this version of the charm, we block the unit if the relation with istio-pilot is missing, so I had to deploy istio-operators in order to make the envoy unit go to active, but after that the reported issue is not present.

orfeas-k commented 5 days ago

That's weird because when when the envoy.yaml was updated, it had been tested by myself and the PR's reviewer https://github.com/canonical/envoy-operator/pull/102#pullrequestreview-2107671499.

orfeas-k commented 5 days ago

Ok so something's wrong with the charm's image, I tried the following and that made the charm go active

jref envoy --resource oci-image=gcr.io/ml-pipeline/metadata-envoy:2.2.0

which is the charm's default image.

Confirmed by deploying the envoy charm with that image and it went to active

jd envoy --channel latest/edge --trust --resource oci-image=gcr.io/ml-pipeline/metadata-envoy:2.2.0

Charm publishing

So it looks like charm's publishing has been messed up.

Publishing from `track/2.0`

You can see that the charm was published using the oci-image 104 https://github.com/canonical/envoy-operator/actions/runs/9662384372/job/26652904155#step:5:180

Publishing from `main`

You can see that the charm was published again using the oci-image 104 https://github.com/canonical/envoy-operator/actions/runs/9701404321/job/26782877651#step:5:184

What happened exactly

Updated envoy in latest/edge using a new image. That created a new resource (oci-image:102) and charm was published using that new resource.
Something happened that created newer resources. I'm not sure what is that but we can see that publish jobs from track/2.0 use as resource oci-image:104
Update envoy charm in latest/edge (with no change in the image). The publish job used also as resource the latest available meaning oci-image:104.

This results in both charms being published using the same image although their metadata.yaml files define a different one.

Conclusion

That is probably why the charm's CI also works, since it builds the charm right there and doesn't pull it.
It is probably related to https://github.com/canonical/charming-actions/issues/139 as well (same with what happened in Kubeflow-dashboard charm and rock)

orfeas-k commented 5 days ago

Charm resource publishing history

The charm has been published with the following resources:

102 first time here (main, 10th June)
103 first time here (main, 19th June)
103 again here (main, 21st June)
104 first time here (track/2.0, 25th June)
104 again here (main, 27th June)

Not sure also what 103 is, since the charm image in main didnt' change after 10th June

orfeas-k commented 1 day ago

After transfering this charm to kubeflow-charmers, we re-released envoy with the resource it had been released with when we updated the manifests executing:

╰─$ charmcraft release envoy --revision=231 --channel latest/edge --resource=oci-image:102

We 'd be looking in the root cause of this as part of https://github.com/canonical/bundle-kubeflow/issues/962.

canonical / envoy-operator