envoyproxy / envoy

Cloud-native high-performance edge/middle/service proxy
https://www.envoyproxy.io
Apache License 2.0
24.28k stars 4.69k forks source link

What will happen to web socket connections when envoy reload new envoyfilters (hot restart ) #34498

Open Yufeireal opened 1 month ago

Yufeireal commented 1 month ago

Hi team, I have a general question regarding what will happen to the requests when envoy hot load the configs.

We're using istio in our eks cluster, I added an access logging envoyfilter recently. I've observed an interesting thing:

When I change the envoyfilter configuration, it looks like the upstream service will temporarily get ~"doubled" requests, latency will jump and decrease in 1 minute.

image

So I am wondering what's actually happned under the hood.. Is this specifically related to access logs changes, thanks!

If someone could tell me more about how envoy hot reload this config, would be very helpful, thanks in advance.

Here is the envoyfilter:

spec:
  configPatches:
  - applyTo: NETWORK_FILTER
    match:
      context: GATEWAY
      listener:
        filterChain:
          filter:
            name: envoy.filters.network.http_connection_manager
    patch:
      operation: MERGE
      value:
        name: envoy.filters.network.http_connection_manager
        typed_config:
          '@type': type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
          access_log:
          - filter:
              and_filter:
                filters:
                - not_health_check_filter: {}
                - or_filter:
                    filters:
                    - status_code_filter:
                        comparison:
                          op: GE
                          value:
                            default_value: 402
                            runtime_key: "null"
                    - runtime_filter:
                        percent_sampled:
                          denominator: HUNDRED
                          numerator: 1
                        runtime_key: "null"
            name: envoy.access_loggers.open_telemetry
            typed_config:
              '@type': type.googleapis.com/envoy.extensions.access_loggers.open_telemetry.v3.OpenTelemetryAccessLogConfig
              body:
                string_value: |-
                  {SOME FORMAT STUFF}
              common_config:
                grpc_service:
                  envoy_grpc:
                    authority: opentelemetry-collector.opentelemetry.svc.cluster.local
                    cluster_name: outbound|4317||opentelemetry-collector.opentelemetry.svc.cluster.local
                log_name: main_gw_http_al
                transport_api_version: V3
              resource_attributes:
                values:
                - key: log_category
                  value:
                    string_value: envoy-access-logs
  workloadSelector:
    labels:
      app: istio-ingressgatewa
Yufeireal commented 1 month ago

Here is another example, another traffic request jumped too image

Yufeireal commented 1 month ago

Oh, the requests might not be doubled. It's websocket connection got terminated and reconnected, looks like this is expected when I update the filter above(listener replace will cause client reconnect). Would be appreciate if someone can share some insights under the hood and will it helpful if I configure drain time for this , thanks!

adisuissa commented 1 month ago

I'm not certain what kind of update happens in your case, but in general:

Yufeireal commented 1 month ago

Thanks a lot for the info. I did patched this to listener's filterChain, so should be an expected behavior from Envoy. Will also look at the doc links you shared. I think at least I have two things worth to try now:

  1. Tune the drain time when envoy hotrestart, maybe can do some manual rollout for envoy pods while they draining.
  2. Yeah, I will experiment patching this to extension using ECDS instead of LDS, see what I can get.
github-actions[bot] commented 4 days ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.