envoyproxy / envoy

Cloud-native high-performance edge/middle/service proxy
https://www.envoyproxy.io
Apache License 2.0
24.71k stars 4.75k forks source link

Slow proxy when request has large payloads #35732

Open kelvingl opened 4 weeks ago

kelvingl commented 4 weeks ago

Title: Slow proxy when request has large payloads

Description: Hi everyone! We are experiencing an issue with Envoy when handling requests with large payloads (around 3MB - without converting to binary). We use Istio and have identified a significant processing delay in the sidecar of the pods receiving the requests. For smaller payloads (around 3KB), the processing time is 1.68ms, whereas for larger payloads (around 3MB), the time increases to 2.57s. We are using the GRPC protocol for communication and have tested both unary and stream requests, but without any significant improvement. There are no noticeable changes in the CPU and memory metrics of the sidecar.

We have also tried configuring kernel parameters and flow-control settings, but without success:

sysctl -w net.netfilter.nf_conntrack_max=6553600
sysctl -w net.netfilter.nf_conntrack_tcp_timeout_established=120
sysctl -w net.ipv4.tcp_synack_retries=1
sysctl -w net.ipv4.tcp_syn_retries=1
sysctl -w net.ipv4.ip_forward=1
sysctl -w net.ipv4.tcp_fin_timeout=15
sysctl -w net.ipv4.tcp_keepalive_time=60
sysctl -w net.ipv4.conf.all.forwarding=1
sysctl -w net.ipv4.ip_forward=1
sysctl -w net.core.somaxconn=16384
sysctl -w net.ipv4.ip_local_port_range="1024 65000"
sysctl -w fs.file-max=2097152
sysctl -w fs.inotify.max_user_instances=16384
sysctl -w fs.inotify.max_user_watches=524288
sysctl -w fs.inotify.max_queued_events=16384
sysctl -w fs.suid_dumpable=2
apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
  name: flow-control-filter
spec:
  workloadSelector:
    labels:
      app: my-app-grpc
  configPatches:
  - applyTo: CLUSTER
    patch:
      operation: MERGE
      value:
        http2_protocol_options:
          max_concurrent_streams: 100000
          initial_stream_window_size: 65536
          initial_connection_window_size: 65536
        per_connection_buffer_limit_bytes: 65536
  - applyTo: NETWORK_FILTER
    match:
      listener:
        filterChain:
          filter:
            name: "envoy.filters.network.http_connection_manager"
    patch:
      operation: MERGE
      value:
        typed_config:
          "@type": "type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager"
          http2_protocol_options:
            max_concurrent_streams: 100000
            initial_stream_window_size: 65536
            initial_connection_window_size: 65536
  - applyTo: LISTENER
    patch:
      operation: MERGE
      value:
        per_connection_buffer_limit_bytes: 65536

We performed CPU profiling during the requests and identified that the Envoy::Network::ConnectionImpl::onWriteReady call is the main time-consuming operation.

Has anyone experienced something similar or have any suggestions for tests we can perform to identify the issue and improve the proxy time?

cc @eduardobaitello

adisuissa commented 3 weeks ago

cc @yanavlasov