envoyproxy / envoy

Cloud-native high-performance edge/middle/service proxy
https://www.envoyproxy.io
Apache License 2.0
24.75k stars 4.75k forks source link

Upstream connections destroyed #12022

Closed supasteev0 closed 3 years ago

supasteev0 commented 4 years ago

Hi,

I have been experiencing an issue with destroyed upstream connections for a long time now, producing 503 errors and higher response times. After digging both Envoy, application and Kubernetes nodes system logs, and trying to tweak some Kernel net settings for quite a while, I could find any explanation to this issue.

Envoy 1.14.1 is deployed with hostPort , 3 replicas, in an EKS 1.16 cluster behind an ALB. Upstream cluster is a headless service (Real Time Bidding fasthttp go application) with 6 pods running. There is an average of 650 Keep Alive connections opened from clients (service providers) sending approximately 4000 requests/second.

Below is an example of Envoy dashboard for this upstream service:

envoy-github

Here is example of cluster stats showing destroyed connections:

cluster.rtbd-prod-v2.upstream_cx_active: 289
cluster.rtbd-prod-v2.upstream_cx_close_notify: 5
cluster.rtbd-prod-v2.upstream_cx_connect_attempts_exceeded: 0
cluster.rtbd-prod-v2.upstream_cx_connect_fail: 2
cluster.rtbd-prod-v2.upstream_cx_connect_timeout: 2
cluster.rtbd-prod-v2.upstream_cx_destroy: 239737
cluster.rtbd-prod-v2.upstream_cx_destroy_local: 238275
cluster.rtbd-prod-v2.upstream_cx_destroy_local_with_active_rq: 237866
cluster.rtbd-prod-v2.upstream_cx_destroy_remote: 1462
cluster.rtbd-prod-v2.upstream_cx_destroy_remote_with_active_rq: 74
cluster.rtbd-prod-v2.upstream_cx_destroy_with_active_rq: 237940
cluster.rtbd-prod-v2.upstream_cx_http1_total: 240026

Both Envoy and the application deployment have the following sysctls configured:

securityContext:
        sysctls:
          - name: net.core.somaxconn
            value: "100000"
          - name: net.ipv4.tcp_max_syn_backlog
            value: "100000"
          - name: net.ipv4.ip_local_port_range
            value: "1025 65535"

Example Envoy configuration:

static_resources:
  listeners:
  - address:
      socket_address: { address: 0.0.0.0, port_value: 80 }
    socket_options:
      - name: 9
        level: 1
        int_value: 1
        state: STATE_LISTENING
      - name: 4
        level: 6
        int_value: 60
        state: STATE_LISTENING
      - name: 5
        level: 6
        int_value: 60
        state: STATE_LISTENING
    filter_chains:
    - filters:
      - name: envoy.filters.network.http_connection_manager
        typed_config:
          "@type": type.googleapis.com/envoy.config.filter.network.http_connection_manager.v2.HttpConnectionManager
          stat_prefix: ingress_http
          codec_type: auto
          route_config:
            name: local_route
            virtual_hosts:
            - name: entrypoint
              domains: ["*"]
              routes:
              - match: { prefix: "/myprefix" }
                route:
                  cluster: second-cluster
              - match: { prefix: "/" }
                route:
                  cluster: first-cluster
                  request_mirror_policies:
                  - cluster: staging-cluster
                    runtime_fraction:
                      default_value:
                        numerator: 100
          http_filters:
          - name: envoy.filters.http.lua
            typed_config:
              "@type": type.googleapis.com/envoy.config.filter.http.lua.v2.Lua
              inline_code: |
                function envoy_on_response(response_handle)
                  if response_handle:headers():get("Connection") then
                    if response_handle:headers():get("Connection") == "close" then
                      response_handle:logInfo("====================> Received Connection closed : removing header")
                      response_handle:headers():remove("Connection")
                    end
                  end
                end
          - name: envoy.filters.http.router
            typed_config: {}
  clusters:
    - name: staging-cluster
      type: STRICT_DNS
      connect_timeout: 1s
      hosts: [{ socket_address: { address: my-staging-cluster.com, port_value: 80 }}]
      upstream_connection_options:
        tcp_keepalive: {}
      circuit_breakers:
        thresholds:
          - priority: DEFAULT
            max_connections: 1000000000
            max_pending_requests: 1000000000
            max_requests: 1000000000
            max_retries: 1000000000
          - priority: HIGH
            max_connections: 1000000000
            max_pending_requests: 1000000000
            max_requests: 1000000000
            max_retries: 1000000000
    - name: first-cluster
      type: STRICT_DNS
      connect_timeout: 1s
      hosts: [{ socket_address: { address: first-cluster.svc.cluster.local, port_value: 5000 }}]
      upstream_connection_options:
        tcp_keepalive:
          keepalive_probes: 1
          keepalive_time: 5
          keepalive_interval: 60
      circuit_breakers:
        thresholds:
          - priority: DEFAULT
            max_connections: 1000000000
            max_pending_requests: 1000000000
            max_requests: 1000000000
            max_retries: 1000000000
          - priority: HIGH
            max_connections: 1000000000
            max_pending_requests: 1000000000
            max_requests: 1000000000
            max_retries: 1000000000
    - name: second-cluster
      type: LOGICAL_DNS
      connect_timeout: 1s
      hosts: [{ socket_address: { address: second-cluster.svc.cluster.local, port_value: 5000 }}]

admin:
  access_log_path: "/dev/null"
  address:
    socket_address:
      protocol: TCP
      address: 0.0.0.0
      port_value: 9901

I have to point out the fact that none of the issue appears with the same trafic load when routing traffic directly to the application. I have also tried to disable mirroring and scale up application pods without getting any satisfying results.

Would you have any advice on how to solve this issue ?

junr03 commented 4 years ago

@snowp do you mind taking a look?

snowp commented 4 years ago

Could you include more upstream stats? It's not clear from the stats why we're doing local resets. If you're able to reproduce this with debug logging enabled that might also prove useful.

supasteev0 commented 4 years ago

Thanks a lot for your reply. Here are the full stats:

cluster.rtbd-prod-v2.assignment_stale: 0
cluster.rtbd-prod-v2.assignment_timeout_received: 0
cluster.rtbd-prod-v2.bind_errors: 0
cluster.rtbd-prod-v2.circuit_breakers.default.cx_open: 0
cluster.rtbd-prod-v2.circuit_breakers.default.cx_pool_open: 0
cluster.rtbd-prod-v2.circuit_breakers.default.rq_open: 0
cluster.rtbd-prod-v2.circuit_breakers.default.rq_pending_open: 0
cluster.rtbd-prod-v2.circuit_breakers.default.rq_retry_open: 0
cluster.rtbd-prod-v2.circuit_breakers.high.cx_open: 0
cluster.rtbd-prod-v2.circuit_breakers.high.cx_pool_open: 0
cluster.rtbd-prod-v2.circuit_breakers.high.rq_open: 0
cluster.rtbd-prod-v2.circuit_breakers.high.rq_pending_open: 0
cluster.rtbd-prod-v2.circuit_breakers.high.rq_retry_open: 0
cluster.rtbd-prod-v2.default.total_match_count: 394098
cluster.rtbd-prod-v2.external.upstream_rq_200: 14368917
cluster.rtbd-prod-v2.external.upstream_rq_204: 253506505
cluster.rtbd-prod-v2.external.upstream_rq_2xx: 267875422
cluster.rtbd-prod-v2.external.upstream_rq_404: 4
cluster.rtbd-prod-v2.external.upstream_rq_4xx: 4
cluster.rtbd-prod-v2.external.upstream_rq_completed: 267875426
cluster.rtbd-prod-v2.http1.dropped_headers_with_underscores: 0
cluster.rtbd-prod-v2.http1.metadata_not_supported_error: 0
cluster.rtbd-prod-v2.http1.requests_rejected_with_underscores_in_headers: 0
cluster.rtbd-prod-v2.http1.response_flood: 0
cluster.rtbd-prod-v2.lb_healthy_panic: 0
cluster.rtbd-prod-v2.lb_local_cluster_not_ok: 0
cluster.rtbd-prod-v2.lb_recalculate_zone_structures: 0
cluster.rtbd-prod-v2.lb_subsets_active: 0
cluster.rtbd-prod-v2.lb_subsets_created: 0
cluster.rtbd-prod-v2.lb_subsets_fallback: 0
cluster.rtbd-prod-v2.lb_subsets_fallback_panic: 0
cluster.rtbd-prod-v2.lb_subsets_removed: 0
cluster.rtbd-prod-v2.lb_subsets_selected: 0
cluster.rtbd-prod-v2.lb_zone_cluster_too_small: 0
cluster.rtbd-prod-v2.lb_zone_no_capacity_left: 0
cluster.rtbd-prod-v2.lb_zone_number_differs: 0
cluster.rtbd-prod-v2.lb_zone_routing_all_directly: 0
cluster.rtbd-prod-v2.lb_zone_routing_cross_zone: 0
cluster.rtbd-prod-v2.lb_zone_routing_sampled: 0
cluster.rtbd-prod-v2.max_host_weight: 1
cluster.rtbd-prod-v2.membership_change: 1
cluster.rtbd-prod-v2.membership_degraded: 0
cluster.rtbd-prod-v2.membership_excluded: 0
cluster.rtbd-prod-v2.membership_healthy: 6
cluster.rtbd-prod-v2.membership_total: 6
cluster.rtbd-prod-v2.original_dst_host_invalid: 0
cluster.rtbd-prod-v2.retry_or_shadow_abandoned: 0
cluster.rtbd-prod-v2.update_attempt: 65683
cluster.rtbd-prod-v2.update_empty: 0
cluster.rtbd-prod-v2.update_failure: 0
cluster.rtbd-prod-v2.update_no_rebuild: 65682
cluster.rtbd-prod-v2.update_success: 65683
cluster.rtbd-prod-v2.upstream_cx_active: 161
cluster.rtbd-prod-v2.upstream_cx_close_notify: 19
cluster.rtbd-prod-v2.upstream_cx_connect_attempts_exceeded: 0
cluster.rtbd-prod-v2.upstream_cx_connect_fail: 2
cluster.rtbd-prod-v2.upstream_cx_connect_timeout: 2
cluster.rtbd-prod-v2.upstream_cx_destroy: 780144
cluster.rtbd-prod-v2.upstream_cx_destroy_local: 778119
cluster.rtbd-prod-v2.upstream_cx_destroy_local_with_active_rq: 778060
cluster.rtbd-prod-v2.upstream_cx_destroy_remote: 2025
cluster.rtbd-prod-v2.upstream_cx_destroy_remote_with_active_rq: 34
cluster.rtbd-prod-v2.upstream_cx_destroy_with_active_rq: 778094
cluster.rtbd-prod-v2.upstream_cx_http1_total: 780305
cluster.rtbd-prod-v2.upstream_cx_http2_total: 0
cluster.rtbd-prod-v2.upstream_cx_idle_timeout: 46
cluster.rtbd-prod-v2.upstream_cx_max_requests: 0
cluster.rtbd-prod-v2.upstream_cx_none_healthy: 0
cluster.rtbd-prod-v2.upstream_cx_overflow: 0
cluster.rtbd-prod-v2.upstream_cx_pool_overflow: 0
cluster.rtbd-prod-v2.upstream_cx_protocol_error: 0
cluster.rtbd-prod-v2.upstream_cx_rx_bytes_buffered: 10154
cluster.rtbd-prod-v2.upstream_cx_rx_bytes_total: 20378386859
cluster.rtbd-prod-v2.upstream_cx_total: 780305
cluster.rtbd-prod-v2.upstream_cx_tx_bytes_buffered: 0
cluster.rtbd-prod-v2.upstream_cx_tx_bytes_total: 460368120574
cluster.rtbd-prod-v2.upstream_flow_control_backed_up_total: 0
cluster.rtbd-prod-v2.upstream_flow_control_drained_total: 0
cluster.rtbd-prod-v2.upstream_flow_control_paused_reading_total: 0
cluster.rtbd-prod-v2.upstream_flow_control_resumed_reading_total: 0
cluster.rtbd-prod-v2.upstream_internal_redirect_failed_total: 0
cluster.rtbd-prod-v2.upstream_internal_redirect_succeeded_total: 0
cluster.rtbd-prod-v2.upstream_rq_200: 14368917
cluster.rtbd-prod-v2.upstream_rq_204: 253506505
cluster.rtbd-prod-v2.upstream_rq_2xx: 267875422
cluster.rtbd-prod-v2.upstream_rq_404: 4
cluster.rtbd-prod-v2.upstream_rq_4xx: 4
cluster.rtbd-prod-v2.upstream_rq_active: 1
cluster.rtbd-prod-v2.upstream_rq_cancelled: 243731
cluster.rtbd-prod-v2.upstream_rq_completed: 267875426
cluster.rtbd-prod-v2.upstream_rq_maintenance_mode: 0
cluster.rtbd-prod-v2.upstream_rq_pending_active: 0
cluster.rtbd-prod-v2.upstream_rq_pending_failure_eject: 0
cluster.rtbd-prod-v2.upstream_rq_pending_overflow: 0
cluster.rtbd-prod-v2.upstream_rq_pending_total: 1103855
cluster.rtbd-prod-v2.upstream_rq_per_try_timeout: 0
cluster.rtbd-prod-v2.upstream_rq_retry: 34
cluster.rtbd-prod-v2.upstream_rq_retry_limit_exceeded: 0
cluster.rtbd-prod-v2.upstream_rq_retry_overflow: 0
cluster.rtbd-prod-v2.upstream_rq_retry_success: 34
cluster.rtbd-prod-v2.upstream_rq_rx_reset: 0
cluster.rtbd-prod-v2.upstream_rq_timeout: 0
cluster.rtbd-prod-v2.upstream_rq_total: 268653513
cluster.rtbd-prod-v2.upstream_rq_tx_reset: 0
cluster.rtbd-prod-v2.version: 0
cluster_manager.active_clusters: 3
cluster_manager.cluster_added: 3
cluster_manager.cluster_modified: 0
cluster_manager.cluster_removed: 0
cluster_manager.cluster_updated: 8
cluster_manager.cluster_updated_via_merge: 0
cluster_manager.update_merge_cancelled: 0
cluster_manager.update_out_of_merge_window: 0
cluster_manager.warming_clusters: 0
filesystem.flushed_by_timer: 32907
filesystem.reopen_failed: 0
filesystem.write_buffered: 436794
filesystem.write_completed: 32910
filesystem.write_failed: 0
filesystem.write_total_buffered: 137
http.admin.downstream_cx_active: 4
http.admin.downstream_cx_delayed_close_timeout: 0
http.admin.downstream_cx_destroy: 370985
http.admin.downstream_cx_destroy_active_rq: 0
http.admin.downstream_cx_destroy_local: 370980
http.admin.downstream_cx_destroy_local_active_rq: 0
http.admin.downstream_cx_destroy_remote: 5
http.admin.downstream_cx_destroy_remote_active_rq: 0
http.admin.downstream_cx_drain_close: 0
http.admin.downstream_cx_http1_active: 4
http.admin.downstream_cx_http1_total: 370989
http.admin.downstream_cx_http2_active: 0
http.admin.downstream_cx_http2_total: 0
http.admin.downstream_cx_http3_active: 0
http.admin.downstream_cx_http3_total: 0
http.admin.downstream_cx_idle_timeout: 0
http.admin.downstream_cx_max_duration_reached: 0
http.admin.downstream_cx_overload_disable_keepalive: 0
http.admin.downstream_cx_protocol_error: 0
http.admin.downstream_cx_rx_bytes_buffered: 4520
http.admin.downstream_cx_rx_bytes_total: 66989306
http.admin.downstream_cx_ssl_active: 0
http.admin.downstream_cx_ssl_total: 0
http.admin.downstream_cx_total: 370989
http.admin.downstream_cx_tx_bytes_buffered: 0
http.admin.downstream_cx_tx_bytes_total: 6492139418
http.admin.downstream_cx_upgrades_active: 0
http.admin.downstream_cx_upgrades_total: 0
http.admin.downstream_flow_control_paused_reading_total: 0
http.admin.downstream_flow_control_resumed_reading_total: 0
http.admin.downstream_rq_1xx: 0
http.admin.downstream_rq_2xx: 436794
http.admin.downstream_rq_3xx: 0
http.admin.downstream_rq_4xx: 0
http.admin.downstream_rq_5xx: 0
http.admin.downstream_rq_active: 1
http.admin.downstream_rq_completed: 436794
http.admin.downstream_rq_http1_total: 436795
http.admin.downstream_rq_http2_total: 0
http.admin.downstream_rq_http3_total: 0
http.admin.downstream_rq_idle_timeout: 0
http.admin.downstream_rq_max_duration_reached: 0
http.admin.downstream_rq_non_relative_path: 0
http.admin.downstream_rq_overload_close: 0
http.admin.downstream_rq_response_before_rq_complete: 0
http.admin.downstream_rq_rx_reset: 0
http.admin.downstream_rq_timeout: 0
http.admin.downstream_rq_too_large: 0
http.admin.downstream_rq_total: 436795
http.admin.downstream_rq_tx_reset: 0
http.admin.downstream_rq_ws_on_non_ws_route: 0
http.admin.rs_too_large: 0
http.async-client.no_cluster: 0
http.async-client.no_route: 0
http.async-client.rq_direct_response: 0
http.async-client.rq_redirect: 0
http.async-client.rq_reset_after_downstream_response_started: 0
http.async-client.rq_retry_skipped_request_not_complete: 0
http.async-client.rq_total: 268897202
http.ingress_http.downstream_cx_active: 208
http.ingress_http.downstream_cx_delayed_close_timeout: 0
http.ingress_http.downstream_cx_destroy: 1319048
http.ingress_http.downstream_cx_destroy_active_rq: 1021788
http.ingress_http.downstream_cx_destroy_local: 0
http.ingress_http.downstream_cx_destroy_local_active_rq: 0
http.ingress_http.downstream_cx_destroy_remote: 1319048
http.ingress_http.downstream_cx_destroy_remote_active_rq: 1021788
http.ingress_http.downstream_cx_drain_close: 0
http.ingress_http.downstream_cx_http1_active: 208
http.ingress_http.downstream_cx_http1_total: 1319256
http.ingress_http.downstream_cx_http2_active: 0
http.ingress_http.downstream_cx_http2_total: 0
http.ingress_http.downstream_cx_http3_active: 0
http.ingress_http.downstream_cx_http3_total: 0
http.ingress_http.downstream_cx_idle_timeout: 0
http.ingress_http.downstream_cx_max_duration_reached: 0
http.ingress_http.downstream_cx_overload_disable_keepalive: 0
http.ingress_http.downstream_cx_protocol_error: 0
http.ingress_http.downstream_cx_rx_bytes_buffered: 311614
http.ingress_http.downstream_cx_rx_bytes_total: 437465468332
http.ingress_http.downstream_cx_ssl_active: 0
http.ingress_http.downstream_cx_ssl_total: 0
http.ingress_http.downstream_cx_total: 1319256
http.ingress_http.downstream_cx_tx_bytes_buffered: 0
http.ingress_http.downstream_cx_tx_bytes_total: 33605216043
http.ingress_http.downstream_cx_upgrades_active: 0
http.ingress_http.downstream_cx_upgrades_total: 0
http.ingress_http.downstream_flow_control_paused_reading_total: 0
http.ingress_http.downstream_flow_control_resumed_reading_total: 0
http.ingress_http.downstream_rq_1xx: 0
http.ingress_http.downstream_rq_2xx: 268631702
http.ingress_http.downstream_rq_3xx: 0
http.ingress_http.downstream_rq_4xx: 4
http.ingress_http.downstream_rq_5xx: 3
http.ingress_http.downstream_rq_active: 1
http.ingress_http.downstream_rq_completed: 268631709
http.ingress_http.downstream_rq_http1_total: 269653498
http.ingress_http.downstream_rq_http2_total: 0
http.ingress_http.downstream_rq_http3_total: 0
http.ingress_http.downstream_rq_idle_timeout: 0
http.ingress_http.downstream_rq_max_duration_reached: 0
http.ingress_http.downstream_rq_non_relative_path: 0
http.ingress_http.downstream_rq_overload_close: 0
http.ingress_http.downstream_rq_response_before_rq_complete: 8
http.ingress_http.downstream_rq_rx_reset: 1021796
http.ingress_http.downstream_rq_timeout: 0
http.ingress_http.downstream_rq_too_large: 0
http.ingress_http.downstream_rq_total: 269653498
http.ingress_http.downstream_rq_tx_reset: 0
http.ingress_http.downstream_rq_ws_on_non_ws_route: 0
http.ingress_http.no_cluster: 0
http.ingress_http.no_route: 0
http.ingress_http.rq_direct_response: 0
http.ingress_http.rq_redirect: 0
http.ingress_http.rq_reset_after_downstream_response_started: 0
http.ingress_http.rq_retry_skipped_request_not_complete: 0
http.ingress_http.rq_total: 269653498
http.ingress_http.rs_too_large: 0
http.ingress_http.tracing.client_enabled: 0
http.ingress_http.tracing.health_check: 0
http.ingress_http.tracing.not_traceable: 0
http.ingress_http.tracing.random_sampling: 0
http.ingress_http.tracing.service_forced: 0
http.ingress_http.user_agent.ios.downstream_cx_destroy_remote_active_rq: 4
http.ingress_http.user_agent.ios.downstream_cx_total: 4
http.ingress_http.user_agent.ios.downstream_rq_total: 4
http1.dropped_headers_with_underscores: 0
http1.metadata_not_supported_error: 0
http1.requests_rejected_with_underscores_in_headers: 0
http1.response_flood: 0
listener.0.0.0.0_80.downstream_cx_active: 208
listener.0.0.0.0_80.downstream_cx_destroy: 1319048
listener.0.0.0.0_80.downstream_cx_total: 1319256
listener.0.0.0.0_80.downstream_pre_cx_active: 0
listener.0.0.0.0_80.downstream_pre_cx_timeout: 0
listener.0.0.0.0_80.http.ingress_http.downstream_rq_1xx: 0
listener.0.0.0.0_80.http.ingress_http.downstream_rq_2xx: 268631702
listener.0.0.0.0_80.http.ingress_http.downstream_rq_3xx: 0
listener.0.0.0.0_80.http.ingress_http.downstream_rq_4xx: 4
listener.0.0.0.0_80.http.ingress_http.downstream_rq_5xx: 3
listener.0.0.0.0_80.http.ingress_http.downstream_rq_completed: 268631709
listener.0.0.0.0_80.no_filter_chain_match: 0
listener.0.0.0.0_80.worker_0.downstream_cx_active: 92
listener.0.0.0.0_80.worker_0.downstream_cx_total: 658828
listener.0.0.0.0_80.worker_1.downstream_cx_active: 116
listener.0.0.0.0_80.worker_1.downstream_cx_total: 660428
listener.admin.downstream_cx_active: 4
listener.admin.downstream_cx_destroy: 370985
listener.admin.downstream_cx_total: 370989
listener.admin.downstream_pre_cx_active: 0
listener.admin.downstream_pre_cx_timeout: 0
listener.admin.http.admin.downstream_rq_1xx: 0
listener.admin.http.admin.downstream_rq_2xx: 436794
listener.admin.http.admin.downstream_rq_3xx: 0
listener.admin.http.admin.downstream_rq_4xx: 0
listener.admin.http.admin.downstream_rq_5xx: 0
listener.admin.http.admin.downstream_rq_completed: 436794
listener.admin.main_thread.downstream_cx_active: 4
listener.admin.main_thread.downstream_cx_total: 370989
listener.admin.no_filter_chain_match: 0
listener_manager.listener_added: 1
listener_manager.listener_create_failure: 0
listener_manager.listener_create_success: 2
listener_manager.listener_modified: 0
listener_manager.listener_removed: 0
listener_manager.listener_stopped: 0
listener_manager.total_listeners_active: 1
listener_manager.total_listeners_draining: 0
listener_manager.total_listeners_warming: 0
listener_manager.workers_started: 1
runtime.admin_overrides_active: 0
runtime.deprecated_feature_use: 0
runtime.load_error: 0
runtime.load_success: 1
runtime.num_keys: 0
runtime.num_layers: 2
runtime.override_dir_exists: 0
runtime.override_dir_not_exists: 1
server.concurrency: 2
server.days_until_first_cert_expiring: 2147483647
server.debug_assertion_failures: 0
server.dynamic_unknown_fields: 0
server.hot_restart_epoch: 0
server.hot_restart_generation: 1
server.live: 1
server.main_thread.watchdog_mega_miss: 0
server.main_thread.watchdog_miss: 3378
server.memory_allocated: 11590704
server.memory_heap_size: 63963136
server.memory_physical_size: 66846720
server.parent_connections: 0
server.state: 0
server.static_unknown_fields: 0
server.stats_recent_lookups: 0
server.total_connections: 208
server.uptime: 329090
server.version: 3474644
server.watchdog_mega_miss: 1
server.watchdog_miss: 16783
server.worker_0.watchdog_mega_miss: 1
server.worker_0.watchdog_miss: 6729
server.worker_1.watchdog_mega_miss: 0
server.worker_1.watchdog_miss: 6676
vhost.entrypoint.vcluster.other.upstream_rq_retry: 0
vhost.entrypoint.vcluster.other.upstream_rq_retry_limit_exceeded: 0
vhost.entrypoint.vcluster.other.upstream_rq_retry_overflow: 0
vhost.entrypoint.vcluster.other.upstream_rq_retry_success: 0
vhost.entrypoint.vcluster.other.upstream_rq_timeout: 0
vhost.entrypoint.vcluster.other.upstream_rq_total: 0
cluster.rtbd-prod-v2.external.upstream_rq_time: P0(0.0,0.0) P25(0.0,0.0) P50(0.0,0.0) P75(0.0,0.0) P90(0.0,1.07174) P95(0.0,3.0639) P99(1.09034,34.9734) P99.5(2.06162,68.9649) P99.9(7.08505,178.449) P100(14.0,2300.0)
cluster.rtbd-prod-v2.upstream_cx_connect_ms: P0(nan,0.0) P25(nan,19.3388) P50(nan,34.4693) P75(nan,60.2587) P90(nan,99.4070) P95(nan,142.931) P99(nan,271.668) P99.5(nan,327.472) P99.9(nan,537.135) P100(nan,1700.0)
cluster.rtbd-prod-v2.upstream_cx_length_ms: P0(nan,0.0) P25(nan,36871.2) P50(nan,54393.5) P75(nan,109632.) P90(nan,169333.) P95(nan,214317.) P99(nan,327479.) P99.5(nan,374922.) P99.9(nan,489505.) P100(nan,3.6e+06)
cluster.rtbd-prod-v2.upstream_rq_time: P0(0.0,0.0) P25(0.0,0.0) P50(0.0,0.0) P75(0.0,0.0) P90(0.0,1.07174) P95(0.0,3.0639) P99(1.09034,34.9734) P99.5(2.06162,68.9649) P99.9(7.08505,178.449) P100(14.0,2300.0)
cluster.rtbd-staging.internal.upstream_rq_time: P0(21.0,21.0) P25(24.4953,25.2731) P50(25.5169,26.2164) P75(26.5476,27.1985) P90(27.3783,28.4119) P95(27.786,30.424) P99(28.9561,69.5672) P99.5(30.1313,113.016) P99.9(35.1367,334.756) P100(39.0,15000.0)
cluster.rtbd-staging.upstream_cx_connect_ms: P0(nan,0.0) P25(nan,31.7553) P50(nan,43.2167) P75(nan,63.5006) P90(nan,100.979) P95(nan,141.379) P99(nan,247.293) P99.5(nan,294.011) P99.9(nan,490.818) P100(nan,1800.0)
cluster.rtbd-staging.upstream_cx_length_ms: P0(nan,760.0) P25(nan,93634.2) P50(nan,184512.) P75(nan,366540.) P90(nan,620720.) P95(nan,821345.) P99(nan,1.28778e+06) P99.5(nan,1.48357e+06) P99.9(nan,1.94783e+06) P100(nan,3.4e+06)
cluster.rtbd-staging.upstream_rq_time: P0(21.0,21.0) P25(24.4953,25.2731) P50(25.5169,26.2164) P75(26.5476,27.1985) P90(27.3783,28.4119) P95(27.786,30.424) P99(28.9561,69.5672) P99.5(30.1313,113.016) P99.9(35.1367,334.756) P100(39.0,15000.0)
cluster.wind-prod-v2.external.upstream_rq_time: P0(0.0,0.0) P25(0.0,0.0) P50(0.0,0.0) P75(0.0,0.0) P90(0.0,1.0386) P95(0.0,3.03342) P99(0.0,32.8201) P99.5(0.0,65.3433) P99.9(0.0,199.804) P100(0.0,1600.0)
cluster.wind-prod-v2.upstream_cx_connect_ms: P0(nan,0.0) P25(nan,0.0) P50(nan,16.2857) P75(nan,42.5) P90(nan,74.1) P95(nan,101.0) P99(nan,263.2) P99.5(nan,277.2) P99.9(nan,734.32) P100(nan,740.0)
cluster.wind-prod-v2.upstream_cx_length_ms: P0(nan,300000.) P25(nan,858750.) P50(nan,1.65909e+06) P75(nan,2.84062e+06) P90(nan,4.39e+06) P95(nan,5.985e+06) P99(nan,1.05617e+07) P99.5(nan,1.10617e+07) P99.9(nan,1.18123e+07) P100(nan,1.2e+07)
cluster.wind-prod-v2.upstream_rq_time: P0(0.0,0.0) P25(0.0,0.0) P50(0.0,0.0) P75(0.0,0.0) P90(0.0,1.0386) P95(0.0,3.03342) P99(0.0,32.8201) P99.5(0.0,65.3433) P99.9(0.0,199.804) P100(0.0,1600.0)
http.admin.downstream_cx_length_ms: P0(0.0,0.0) P25(0.0,0.0) P50(0.0,0.0) P75(0.0,0.0) P90(0.0,0.0) P95(0.0,0.0) P99(0.0,1.00791) P99.5(0.0,2.05736) P99.9(0.0,27.224) P100(0.0,2.5e+08)
http.admin.downstream_rq_time: P0(0.0,0.0) P25(0.0,0.0) P50(0.0,0.0) P75(0.0,0.0) P90(4.05,3.09519) P95(4.075,4.07041) P99(4.095,13.5023) P99.5(4.0975,25.0353) P99.9(4.0995,205.751) P100(4.1,490.0)
http.ingress_http.downstream_cx_length_ms: P0(nan,0.0) P25(nan,3503.85) P50(nan,42373.6) P75(nan,86809.1) P90(nan,148048.) P95(nan,201285.) P99(nan,317895.) P99.5(nan,369358.) P99.9(nan,491335.) P100(nan,1.4e+06)
http.ingress_http.downstream_rq_time: P0(0.0,0.0) P25(0.0,0.0) P50(0.0,0.0) P75(0.0,0.0) P90(1.00138,1.08627) P95(1.05437,3.07838) P99(1.09676,35.2558) P99.5(2.06885,69.2689) P99.9(7.08475,179.255) P100(14.0,2300.0)
http.ingress_http.user_agent.ios.downstream_cx_length_ms: P0(nan,70000.0) P25(nan,71000.0) P50(nan,74000.0) P75(nan,85000.0) P90(nan,166000.) P95(nan,168000.) P99(nan,169600.) P99.5(nan,169800.) P99.9(nan,169960.) P100(nan,170000.)
listener.0.0.0.0_80.downstream_cx_length_ms: P0(nan,0.0) P25(nan,3503.85) P50(nan,42373.5) P75(nan,86809.1) P90(nan,148048.) P95(nan,201285.) P99(nan,317895.) P99.5(nan,369358.) P99.9(nan,491335.) P100(nan,1.4e+06)
listener.admin.downstream_cx_length_ms: P0(0.0,0.0) P25(0.0,0.0) P50(0.0,0.0) P75(0.0,0.0) P90(0.0,0.0) P95(0.0,0.0) P99(0.0,1.01237) P99.5(0.0,2.06713) P99.9(0.0,28.1129) P100(0.0,2.5e+08)
server.initialization_time_ms: P0(nan,14.0) P25(nan,14.25) P50(nan,14.5) P75(nan,14.75) P90(nan,14.9) P95(nan,14.95) P99(nan,14.99) P99.5(nan,14.995) P99.9(nan,14.999) P100(nan,15.0)

Unfortunately, I'm only able to reproduce this behavior in production, and, when enabling trace log level, I get a lot of lof errors. I was still able to gather some logs which might prove useful:

[2020-07-10 08:36:47.374][11][trace][main] [source/common/event/dispatcher_impl.cc:163] item added to deferred deletion list (size=1094)
[2020-07-10 08:36:47.374][11][trace][main] [source/common/event/dispatcher_impl.cc:163] item added to deferred deletion list (size=1093)
[2020-07-10 08:36:47.374][11][debug][pool] [source/common/http/conn_pool_base.cc:265] [C61207] client disconnected, failure reason: 
[2020-07-10 08:36:47.374][11][debug][client] [source/common/http/codec_client.cc:91] [C61207] disconnect. resetting 0 pending requests
[2020-07-10 08:36:47.374][11][debug][connection] [source/common/network/connection_impl.cc:200] [C61207] closing socket: 1
[2020-07-10 08:36:47.374][11][debug][connection] [source/common/network/connection_impl.cc:109] [C61207] closing data_to_write=0 type=1
[2020-07-10 08:36:47.374][11][trace][main] [source/common/event/dispatcher_impl.cc:163] item added to deferred deletion list (size=1092)
[2020-07-10 08:36:47.374][11][debug][router] [source/common/router/upstream_request.cc:263] [C60447][S3734311586874250871] resetting pool request
[2020-07-10 08:36:47.374][11][debug][http] [source/common/http/conn_manager_impl.cc:1936] [C60447][S3734311586874250871] stream reset
[2020-07-10 08:36:47.374][11][debug][connection] [source/common/network/connection_impl.cc:200] [C60447] closing socket: 0
[2020-07-10 08:36:47.374][11][debug][connection] [source/common/network/connection_impl.cc:514] [C60447] remote early close
[2020-07-10 08:36:47.374][11][trace][connection] [source/common/network/connection_impl.cc:493] [C60447] socket event: 6
[2020-07-10 08:36:47.374][11][trace][connection] [source/common/network/raw_buffer_socket.cc:68] [C61261] write returns: 1365
[2020-07-10 08:36:47.374][11][trace][connection] [source/common/network/connection_impl.cc:581] [C61261] write ready
[2020-07-10 08:36:47.436][12][trace][connection] [source/common/network/raw_buffer_socket.cc:39] [C61266] read error: Resource temporarily unavailable
[2020-07-10 08:36:47.437][12][debug][pool] [source/common/http/conn_pool_base.cc:265] [C61256] client disconnected, failure reason:
[2020-07-10 08:36:47.437][12][debug][client] [source/common/http/codec_client.cc:91] [C61256] disconnect. resetting 0 pending requests
[2020-07-10 08:36:47.437][12][debug][connection] [source/common/network/connection_impl.cc:200] [C61256] closing socket: 1

If you need more logs, I will try to enable only debug level again.

supasteev0 commented 4 years ago

Hi. Any chance someone would have some time to take a look ? Thanks a lot.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

supasteev0 commented 4 years ago

/remove-lifecycle stale

supasteev0 commented 4 years ago

Hello, we are still having issues with high number of connections dropped. I tried many different configuration scenarios, but no luck so far.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

supasteev0 commented 3 years ago

/remove-lifecycle stale

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

github-actions[bot] commented 3 years ago

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.