envoyproxy / envoy

Cloud-native high-performance edge/middle/service proxy
https://www.envoyproxy.io
Apache License 2.0
24.99k stars 4.81k forks source link

XDS DeltaVirtualHosts gRPC config stream to xxx closed: grpc: received message larger than max #36169

Closed dmavrommatis closed 2 weeks ago

dmavrommatis commented 1 month ago

Title: xDS server sends larger message than max

Description: I have an envoy configuration that uses RDS and has more than 50k routes. Even though I am using DELTA_GRPC sometimes the proxy will end-up not being able to receive any new updates with the error message:

[2024-09-16 17:17:12.814][1][warning][config] [./source/extensions/config_subscription/grpc/grpc_stream.h:190] DeltaVirtualHosts gRPC config stream to xds_cluster closed since 105s ago: 8, grpc: received message larger than max (21173076 vs. 4194304)

Locally, I created a gRCP client with higher grpc.MaxCallRecvMsgSize(math.MaxInt32) and it worked but I am curious if this is something that we want to be able to configure on envoy as well. Any idea why using the DELTA API is not enough and it batches bigger updates than the gRPC can handle?

I also saw that defaultServerMaxSendMessageSize = math.MaxInt32 vs defaultClientMaxReceiveMessageSize = 1024 * 1024 * 4 which is exactly what causes the issue to appear.

Repro steps:

  1. Use a simple cache xDS server from https://github.com/envoyproxy/go-control-plane
  2. Add tens of thousands of routes on RDS
  3. Error message appears

Config: envoy.yaml

    admin:
      access_log_path: /dev/null
      address:
        socket_address:
          address: 0.0.0.0
          port_value: {{ .Values.proxy.config.info_port }}
    dynamic_resources:
      ads_config:
        api_type: DELTA_GRPC
        transport_api_version: V3
        grpc_services:
          - envoy_grpc:
              cluster_name: xds_cluster
        set_node_on_first_message_only: true
      cds_config:
        resource_api_version: V3
        ads: { }
      lds_config:
        path_config_source:
          path: {{ .Values.configgen.config.lds_path }}
    node:
      cluster: envoy-cluster
      id: {{ .Values.global.xdsNodeID }}
    static_resources:
      clusters:
        - name: xds_cluster
          type: STRICT_DNS
          connect_timeout: 10s
          load_assignment:
            cluster_name: xds_cluster
            endpoints:
              - lb_endpoints:
                  - endpoint:
                      address:
                        socket_address:
                          address: {{ .Values.global.xdsAddress }}
                          port_value: {{ .Values.global.xdsPort }}
          http2_protocol_options: { }
    layered_runtime:
      layers:
        - name: runtime-0
          rtds_layer:
            rtds_config:
              resource_api_version: V3
              api_config_source:
                transport_api_version: V3
                api_type: DELTA_GRPC
                grpc_services:
                  envoy_grpc:
                    cluster_name: xds_cluster
            name: runtime-0

lds.yaml

version_info: "0"
resources:
  - "@type": "type.googleapis.com/envoy.config.listener.v3.Listener"
    name: http_listener
    address:
      socket_address:
        address: 0.0.0.0
        port_value: 80
    filter_chains:
      - filters:
          - name: envoy.filters.network.http_connection_manager
            typed_config:
              "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
              stat_prefix: http
              codec_type: AUTO
              server_name: "abc"
              strip_any_host_port: true
              rds:
                route_config_name: "{{ .Route_config_name }}"
                config_source:
                  resource_api_version: V3
                  api_config_source:
                    api_type: DELTA_GRPC
                    transport_api_version: V3
                    grpc_services:
                      - envoy_grpc:
                          cluster_name: xds_cluster
                    set_node_on_first_message_only: true
              http_filters:
                - name: envoy.filters.http.router
                  typed_config:
                    "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router

Logs:

[2024-09-16 17:17:12.814][1][warning][config] [./source/extensions/config_subscription/grpc/grpc_stream.h:190] DeltaVirtualHosts gRPC config stream to xds_cluster closed since 105s ago: 8, grpc: received message larger than max (21173076 vs. 4194304)
zuercher commented 1 month ago

Based on my knowledge of delta XDS, we could probably split up the subscription requests to avoid hitting the default receive message size limit. But that limit is somewhat arbitrary. It don't recall it showing up in the gRPC spec and there's nothing to prevent a different gRPC server implementation from choosing a different limit. I think this ends up being a well-meaning default for servers with untrusted clients that trips up systems with trusted clients as their scale grows.

dmavrommatis commented 1 month ago

Based on my knowledge of delta XDS, we could probably split up the subscription requests to avoid hitting the default receive message size limit. But that limit is somewhat arbitrary. It don't recall it showing up in the gRPC spec and there's nothing to prevent a different gRPC server implementation from choosing a different limit. I think this ends up being a well-meaning default for servers with untrusted clients that trips up systems with trusted clients as their scale grows.

I am using the https://github.com/envoyproxy/go-control-plane implementation of xDS server and it looks like it doesn't split up the requests and just full sends all the deltas disregarding the size.

In any case; the 4MB limit size on the receiving end of envoy seems very low. I haven't used/see other implementations of the control-plane (e.g. https://github.com/envoyproxy/java-control-plane) so it might be only the golang one that is the problematic and does not split-up messages.

github-actions[bot] commented 3 weeks ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

github-actions[bot] commented 2 weeks ago

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.