Closed supasteev0 closed 3 years ago
@snowp do you mind taking a look?
Could you include more upstream stats? It's not clear from the stats why we're doing local resets. If you're able to reproduce this with debug logging enabled that might also prove useful.
Thanks a lot for your reply. Here are the full stats:
cluster.rtbd-prod-v2.assignment_stale: 0
cluster.rtbd-prod-v2.assignment_timeout_received: 0
cluster.rtbd-prod-v2.bind_errors: 0
cluster.rtbd-prod-v2.circuit_breakers.default.cx_open: 0
cluster.rtbd-prod-v2.circuit_breakers.default.cx_pool_open: 0
cluster.rtbd-prod-v2.circuit_breakers.default.rq_open: 0
cluster.rtbd-prod-v2.circuit_breakers.default.rq_pending_open: 0
cluster.rtbd-prod-v2.circuit_breakers.default.rq_retry_open: 0
cluster.rtbd-prod-v2.circuit_breakers.high.cx_open: 0
cluster.rtbd-prod-v2.circuit_breakers.high.cx_pool_open: 0
cluster.rtbd-prod-v2.circuit_breakers.high.rq_open: 0
cluster.rtbd-prod-v2.circuit_breakers.high.rq_pending_open: 0
cluster.rtbd-prod-v2.circuit_breakers.high.rq_retry_open: 0
cluster.rtbd-prod-v2.default.total_match_count: 394098
cluster.rtbd-prod-v2.external.upstream_rq_200: 14368917
cluster.rtbd-prod-v2.external.upstream_rq_204: 253506505
cluster.rtbd-prod-v2.external.upstream_rq_2xx: 267875422
cluster.rtbd-prod-v2.external.upstream_rq_404: 4
cluster.rtbd-prod-v2.external.upstream_rq_4xx: 4
cluster.rtbd-prod-v2.external.upstream_rq_completed: 267875426
cluster.rtbd-prod-v2.http1.dropped_headers_with_underscores: 0
cluster.rtbd-prod-v2.http1.metadata_not_supported_error: 0
cluster.rtbd-prod-v2.http1.requests_rejected_with_underscores_in_headers: 0
cluster.rtbd-prod-v2.http1.response_flood: 0
cluster.rtbd-prod-v2.lb_healthy_panic: 0
cluster.rtbd-prod-v2.lb_local_cluster_not_ok: 0
cluster.rtbd-prod-v2.lb_recalculate_zone_structures: 0
cluster.rtbd-prod-v2.lb_subsets_active: 0
cluster.rtbd-prod-v2.lb_subsets_created: 0
cluster.rtbd-prod-v2.lb_subsets_fallback: 0
cluster.rtbd-prod-v2.lb_subsets_fallback_panic: 0
cluster.rtbd-prod-v2.lb_subsets_removed: 0
cluster.rtbd-prod-v2.lb_subsets_selected: 0
cluster.rtbd-prod-v2.lb_zone_cluster_too_small: 0
cluster.rtbd-prod-v2.lb_zone_no_capacity_left: 0
cluster.rtbd-prod-v2.lb_zone_number_differs: 0
cluster.rtbd-prod-v2.lb_zone_routing_all_directly: 0
cluster.rtbd-prod-v2.lb_zone_routing_cross_zone: 0
cluster.rtbd-prod-v2.lb_zone_routing_sampled: 0
cluster.rtbd-prod-v2.max_host_weight: 1
cluster.rtbd-prod-v2.membership_change: 1
cluster.rtbd-prod-v2.membership_degraded: 0
cluster.rtbd-prod-v2.membership_excluded: 0
cluster.rtbd-prod-v2.membership_healthy: 6
cluster.rtbd-prod-v2.membership_total: 6
cluster.rtbd-prod-v2.original_dst_host_invalid: 0
cluster.rtbd-prod-v2.retry_or_shadow_abandoned: 0
cluster.rtbd-prod-v2.update_attempt: 65683
cluster.rtbd-prod-v2.update_empty: 0
cluster.rtbd-prod-v2.update_failure: 0
cluster.rtbd-prod-v2.update_no_rebuild: 65682
cluster.rtbd-prod-v2.update_success: 65683
cluster.rtbd-prod-v2.upstream_cx_active: 161
cluster.rtbd-prod-v2.upstream_cx_close_notify: 19
cluster.rtbd-prod-v2.upstream_cx_connect_attempts_exceeded: 0
cluster.rtbd-prod-v2.upstream_cx_connect_fail: 2
cluster.rtbd-prod-v2.upstream_cx_connect_timeout: 2
cluster.rtbd-prod-v2.upstream_cx_destroy: 780144
cluster.rtbd-prod-v2.upstream_cx_destroy_local: 778119
cluster.rtbd-prod-v2.upstream_cx_destroy_local_with_active_rq: 778060
cluster.rtbd-prod-v2.upstream_cx_destroy_remote: 2025
cluster.rtbd-prod-v2.upstream_cx_destroy_remote_with_active_rq: 34
cluster.rtbd-prod-v2.upstream_cx_destroy_with_active_rq: 778094
cluster.rtbd-prod-v2.upstream_cx_http1_total: 780305
cluster.rtbd-prod-v2.upstream_cx_http2_total: 0
cluster.rtbd-prod-v2.upstream_cx_idle_timeout: 46
cluster.rtbd-prod-v2.upstream_cx_max_requests: 0
cluster.rtbd-prod-v2.upstream_cx_none_healthy: 0
cluster.rtbd-prod-v2.upstream_cx_overflow: 0
cluster.rtbd-prod-v2.upstream_cx_pool_overflow: 0
cluster.rtbd-prod-v2.upstream_cx_protocol_error: 0
cluster.rtbd-prod-v2.upstream_cx_rx_bytes_buffered: 10154
cluster.rtbd-prod-v2.upstream_cx_rx_bytes_total: 20378386859
cluster.rtbd-prod-v2.upstream_cx_total: 780305
cluster.rtbd-prod-v2.upstream_cx_tx_bytes_buffered: 0
cluster.rtbd-prod-v2.upstream_cx_tx_bytes_total: 460368120574
cluster.rtbd-prod-v2.upstream_flow_control_backed_up_total: 0
cluster.rtbd-prod-v2.upstream_flow_control_drained_total: 0
cluster.rtbd-prod-v2.upstream_flow_control_paused_reading_total: 0
cluster.rtbd-prod-v2.upstream_flow_control_resumed_reading_total: 0
cluster.rtbd-prod-v2.upstream_internal_redirect_failed_total: 0
cluster.rtbd-prod-v2.upstream_internal_redirect_succeeded_total: 0
cluster.rtbd-prod-v2.upstream_rq_200: 14368917
cluster.rtbd-prod-v2.upstream_rq_204: 253506505
cluster.rtbd-prod-v2.upstream_rq_2xx: 267875422
cluster.rtbd-prod-v2.upstream_rq_404: 4
cluster.rtbd-prod-v2.upstream_rq_4xx: 4
cluster.rtbd-prod-v2.upstream_rq_active: 1
cluster.rtbd-prod-v2.upstream_rq_cancelled: 243731
cluster.rtbd-prod-v2.upstream_rq_completed: 267875426
cluster.rtbd-prod-v2.upstream_rq_maintenance_mode: 0
cluster.rtbd-prod-v2.upstream_rq_pending_active: 0
cluster.rtbd-prod-v2.upstream_rq_pending_failure_eject: 0
cluster.rtbd-prod-v2.upstream_rq_pending_overflow: 0
cluster.rtbd-prod-v2.upstream_rq_pending_total: 1103855
cluster.rtbd-prod-v2.upstream_rq_per_try_timeout: 0
cluster.rtbd-prod-v2.upstream_rq_retry: 34
cluster.rtbd-prod-v2.upstream_rq_retry_limit_exceeded: 0
cluster.rtbd-prod-v2.upstream_rq_retry_overflow: 0
cluster.rtbd-prod-v2.upstream_rq_retry_success: 34
cluster.rtbd-prod-v2.upstream_rq_rx_reset: 0
cluster.rtbd-prod-v2.upstream_rq_timeout: 0
cluster.rtbd-prod-v2.upstream_rq_total: 268653513
cluster.rtbd-prod-v2.upstream_rq_tx_reset: 0
cluster.rtbd-prod-v2.version: 0
cluster_manager.active_clusters: 3
cluster_manager.cluster_added: 3
cluster_manager.cluster_modified: 0
cluster_manager.cluster_removed: 0
cluster_manager.cluster_updated: 8
cluster_manager.cluster_updated_via_merge: 0
cluster_manager.update_merge_cancelled: 0
cluster_manager.update_out_of_merge_window: 0
cluster_manager.warming_clusters: 0
filesystem.flushed_by_timer: 32907
filesystem.reopen_failed: 0
filesystem.write_buffered: 436794
filesystem.write_completed: 32910
filesystem.write_failed: 0
filesystem.write_total_buffered: 137
http.admin.downstream_cx_active: 4
http.admin.downstream_cx_delayed_close_timeout: 0
http.admin.downstream_cx_destroy: 370985
http.admin.downstream_cx_destroy_active_rq: 0
http.admin.downstream_cx_destroy_local: 370980
http.admin.downstream_cx_destroy_local_active_rq: 0
http.admin.downstream_cx_destroy_remote: 5
http.admin.downstream_cx_destroy_remote_active_rq: 0
http.admin.downstream_cx_drain_close: 0
http.admin.downstream_cx_http1_active: 4
http.admin.downstream_cx_http1_total: 370989
http.admin.downstream_cx_http2_active: 0
http.admin.downstream_cx_http2_total: 0
http.admin.downstream_cx_http3_active: 0
http.admin.downstream_cx_http3_total: 0
http.admin.downstream_cx_idle_timeout: 0
http.admin.downstream_cx_max_duration_reached: 0
http.admin.downstream_cx_overload_disable_keepalive: 0
http.admin.downstream_cx_protocol_error: 0
http.admin.downstream_cx_rx_bytes_buffered: 4520
http.admin.downstream_cx_rx_bytes_total: 66989306
http.admin.downstream_cx_ssl_active: 0
http.admin.downstream_cx_ssl_total: 0
http.admin.downstream_cx_total: 370989
http.admin.downstream_cx_tx_bytes_buffered: 0
http.admin.downstream_cx_tx_bytes_total: 6492139418
http.admin.downstream_cx_upgrades_active: 0
http.admin.downstream_cx_upgrades_total: 0
http.admin.downstream_flow_control_paused_reading_total: 0
http.admin.downstream_flow_control_resumed_reading_total: 0
http.admin.downstream_rq_1xx: 0
http.admin.downstream_rq_2xx: 436794
http.admin.downstream_rq_3xx: 0
http.admin.downstream_rq_4xx: 0
http.admin.downstream_rq_5xx: 0
http.admin.downstream_rq_active: 1
http.admin.downstream_rq_completed: 436794
http.admin.downstream_rq_http1_total: 436795
http.admin.downstream_rq_http2_total: 0
http.admin.downstream_rq_http3_total: 0
http.admin.downstream_rq_idle_timeout: 0
http.admin.downstream_rq_max_duration_reached: 0
http.admin.downstream_rq_non_relative_path: 0
http.admin.downstream_rq_overload_close: 0
http.admin.downstream_rq_response_before_rq_complete: 0
http.admin.downstream_rq_rx_reset: 0
http.admin.downstream_rq_timeout: 0
http.admin.downstream_rq_too_large: 0
http.admin.downstream_rq_total: 436795
http.admin.downstream_rq_tx_reset: 0
http.admin.downstream_rq_ws_on_non_ws_route: 0
http.admin.rs_too_large: 0
http.async-client.no_cluster: 0
http.async-client.no_route: 0
http.async-client.rq_direct_response: 0
http.async-client.rq_redirect: 0
http.async-client.rq_reset_after_downstream_response_started: 0
http.async-client.rq_retry_skipped_request_not_complete: 0
http.async-client.rq_total: 268897202
http.ingress_http.downstream_cx_active: 208
http.ingress_http.downstream_cx_delayed_close_timeout: 0
http.ingress_http.downstream_cx_destroy: 1319048
http.ingress_http.downstream_cx_destroy_active_rq: 1021788
http.ingress_http.downstream_cx_destroy_local: 0
http.ingress_http.downstream_cx_destroy_local_active_rq: 0
http.ingress_http.downstream_cx_destroy_remote: 1319048
http.ingress_http.downstream_cx_destroy_remote_active_rq: 1021788
http.ingress_http.downstream_cx_drain_close: 0
http.ingress_http.downstream_cx_http1_active: 208
http.ingress_http.downstream_cx_http1_total: 1319256
http.ingress_http.downstream_cx_http2_active: 0
http.ingress_http.downstream_cx_http2_total: 0
http.ingress_http.downstream_cx_http3_active: 0
http.ingress_http.downstream_cx_http3_total: 0
http.ingress_http.downstream_cx_idle_timeout: 0
http.ingress_http.downstream_cx_max_duration_reached: 0
http.ingress_http.downstream_cx_overload_disable_keepalive: 0
http.ingress_http.downstream_cx_protocol_error: 0
http.ingress_http.downstream_cx_rx_bytes_buffered: 311614
http.ingress_http.downstream_cx_rx_bytes_total: 437465468332
http.ingress_http.downstream_cx_ssl_active: 0
http.ingress_http.downstream_cx_ssl_total: 0
http.ingress_http.downstream_cx_total: 1319256
http.ingress_http.downstream_cx_tx_bytes_buffered: 0
http.ingress_http.downstream_cx_tx_bytes_total: 33605216043
http.ingress_http.downstream_cx_upgrades_active: 0
http.ingress_http.downstream_cx_upgrades_total: 0
http.ingress_http.downstream_flow_control_paused_reading_total: 0
http.ingress_http.downstream_flow_control_resumed_reading_total: 0
http.ingress_http.downstream_rq_1xx: 0
http.ingress_http.downstream_rq_2xx: 268631702
http.ingress_http.downstream_rq_3xx: 0
http.ingress_http.downstream_rq_4xx: 4
http.ingress_http.downstream_rq_5xx: 3
http.ingress_http.downstream_rq_active: 1
http.ingress_http.downstream_rq_completed: 268631709
http.ingress_http.downstream_rq_http1_total: 269653498
http.ingress_http.downstream_rq_http2_total: 0
http.ingress_http.downstream_rq_http3_total: 0
http.ingress_http.downstream_rq_idle_timeout: 0
http.ingress_http.downstream_rq_max_duration_reached: 0
http.ingress_http.downstream_rq_non_relative_path: 0
http.ingress_http.downstream_rq_overload_close: 0
http.ingress_http.downstream_rq_response_before_rq_complete: 8
http.ingress_http.downstream_rq_rx_reset: 1021796
http.ingress_http.downstream_rq_timeout: 0
http.ingress_http.downstream_rq_too_large: 0
http.ingress_http.downstream_rq_total: 269653498
http.ingress_http.downstream_rq_tx_reset: 0
http.ingress_http.downstream_rq_ws_on_non_ws_route: 0
http.ingress_http.no_cluster: 0
http.ingress_http.no_route: 0
http.ingress_http.rq_direct_response: 0
http.ingress_http.rq_redirect: 0
http.ingress_http.rq_reset_after_downstream_response_started: 0
http.ingress_http.rq_retry_skipped_request_not_complete: 0
http.ingress_http.rq_total: 269653498
http.ingress_http.rs_too_large: 0
http.ingress_http.tracing.client_enabled: 0
http.ingress_http.tracing.health_check: 0
http.ingress_http.tracing.not_traceable: 0
http.ingress_http.tracing.random_sampling: 0
http.ingress_http.tracing.service_forced: 0
http.ingress_http.user_agent.ios.downstream_cx_destroy_remote_active_rq: 4
http.ingress_http.user_agent.ios.downstream_cx_total: 4
http.ingress_http.user_agent.ios.downstream_rq_total: 4
http1.dropped_headers_with_underscores: 0
http1.metadata_not_supported_error: 0
http1.requests_rejected_with_underscores_in_headers: 0
http1.response_flood: 0
listener.0.0.0.0_80.downstream_cx_active: 208
listener.0.0.0.0_80.downstream_cx_destroy: 1319048
listener.0.0.0.0_80.downstream_cx_total: 1319256
listener.0.0.0.0_80.downstream_pre_cx_active: 0
listener.0.0.0.0_80.downstream_pre_cx_timeout: 0
listener.0.0.0.0_80.http.ingress_http.downstream_rq_1xx: 0
listener.0.0.0.0_80.http.ingress_http.downstream_rq_2xx: 268631702
listener.0.0.0.0_80.http.ingress_http.downstream_rq_3xx: 0
listener.0.0.0.0_80.http.ingress_http.downstream_rq_4xx: 4
listener.0.0.0.0_80.http.ingress_http.downstream_rq_5xx: 3
listener.0.0.0.0_80.http.ingress_http.downstream_rq_completed: 268631709
listener.0.0.0.0_80.no_filter_chain_match: 0
listener.0.0.0.0_80.worker_0.downstream_cx_active: 92
listener.0.0.0.0_80.worker_0.downstream_cx_total: 658828
listener.0.0.0.0_80.worker_1.downstream_cx_active: 116
listener.0.0.0.0_80.worker_1.downstream_cx_total: 660428
listener.admin.downstream_cx_active: 4
listener.admin.downstream_cx_destroy: 370985
listener.admin.downstream_cx_total: 370989
listener.admin.downstream_pre_cx_active: 0
listener.admin.downstream_pre_cx_timeout: 0
listener.admin.http.admin.downstream_rq_1xx: 0
listener.admin.http.admin.downstream_rq_2xx: 436794
listener.admin.http.admin.downstream_rq_3xx: 0
listener.admin.http.admin.downstream_rq_4xx: 0
listener.admin.http.admin.downstream_rq_5xx: 0
listener.admin.http.admin.downstream_rq_completed: 436794
listener.admin.main_thread.downstream_cx_active: 4
listener.admin.main_thread.downstream_cx_total: 370989
listener.admin.no_filter_chain_match: 0
listener_manager.listener_added: 1
listener_manager.listener_create_failure: 0
listener_manager.listener_create_success: 2
listener_manager.listener_modified: 0
listener_manager.listener_removed: 0
listener_manager.listener_stopped: 0
listener_manager.total_listeners_active: 1
listener_manager.total_listeners_draining: 0
listener_manager.total_listeners_warming: 0
listener_manager.workers_started: 1
runtime.admin_overrides_active: 0
runtime.deprecated_feature_use: 0
runtime.load_error: 0
runtime.load_success: 1
runtime.num_keys: 0
runtime.num_layers: 2
runtime.override_dir_exists: 0
runtime.override_dir_not_exists: 1
server.concurrency: 2
server.days_until_first_cert_expiring: 2147483647
server.debug_assertion_failures: 0
server.dynamic_unknown_fields: 0
server.hot_restart_epoch: 0
server.hot_restart_generation: 1
server.live: 1
server.main_thread.watchdog_mega_miss: 0
server.main_thread.watchdog_miss: 3378
server.memory_allocated: 11590704
server.memory_heap_size: 63963136
server.memory_physical_size: 66846720
server.parent_connections: 0
server.state: 0
server.static_unknown_fields: 0
server.stats_recent_lookups: 0
server.total_connections: 208
server.uptime: 329090
server.version: 3474644
server.watchdog_mega_miss: 1
server.watchdog_miss: 16783
server.worker_0.watchdog_mega_miss: 1
server.worker_0.watchdog_miss: 6729
server.worker_1.watchdog_mega_miss: 0
server.worker_1.watchdog_miss: 6676
vhost.entrypoint.vcluster.other.upstream_rq_retry: 0
vhost.entrypoint.vcluster.other.upstream_rq_retry_limit_exceeded: 0
vhost.entrypoint.vcluster.other.upstream_rq_retry_overflow: 0
vhost.entrypoint.vcluster.other.upstream_rq_retry_success: 0
vhost.entrypoint.vcluster.other.upstream_rq_timeout: 0
vhost.entrypoint.vcluster.other.upstream_rq_total: 0
cluster.rtbd-prod-v2.external.upstream_rq_time: P0(0.0,0.0) P25(0.0,0.0) P50(0.0,0.0) P75(0.0,0.0) P90(0.0,1.07174) P95(0.0,3.0639) P99(1.09034,34.9734) P99.5(2.06162,68.9649) P99.9(7.08505,178.449) P100(14.0,2300.0)
cluster.rtbd-prod-v2.upstream_cx_connect_ms: P0(nan,0.0) P25(nan,19.3388) P50(nan,34.4693) P75(nan,60.2587) P90(nan,99.4070) P95(nan,142.931) P99(nan,271.668) P99.5(nan,327.472) P99.9(nan,537.135) P100(nan,1700.0)
cluster.rtbd-prod-v2.upstream_cx_length_ms: P0(nan,0.0) P25(nan,36871.2) P50(nan,54393.5) P75(nan,109632.) P90(nan,169333.) P95(nan,214317.) P99(nan,327479.) P99.5(nan,374922.) P99.9(nan,489505.) P100(nan,3.6e+06)
cluster.rtbd-prod-v2.upstream_rq_time: P0(0.0,0.0) P25(0.0,0.0) P50(0.0,0.0) P75(0.0,0.0) P90(0.0,1.07174) P95(0.0,3.0639) P99(1.09034,34.9734) P99.5(2.06162,68.9649) P99.9(7.08505,178.449) P100(14.0,2300.0)
cluster.rtbd-staging.internal.upstream_rq_time: P0(21.0,21.0) P25(24.4953,25.2731) P50(25.5169,26.2164) P75(26.5476,27.1985) P90(27.3783,28.4119) P95(27.786,30.424) P99(28.9561,69.5672) P99.5(30.1313,113.016) P99.9(35.1367,334.756) P100(39.0,15000.0)
cluster.rtbd-staging.upstream_cx_connect_ms: P0(nan,0.0) P25(nan,31.7553) P50(nan,43.2167) P75(nan,63.5006) P90(nan,100.979) P95(nan,141.379) P99(nan,247.293) P99.5(nan,294.011) P99.9(nan,490.818) P100(nan,1800.0)
cluster.rtbd-staging.upstream_cx_length_ms: P0(nan,760.0) P25(nan,93634.2) P50(nan,184512.) P75(nan,366540.) P90(nan,620720.) P95(nan,821345.) P99(nan,1.28778e+06) P99.5(nan,1.48357e+06) P99.9(nan,1.94783e+06) P100(nan,3.4e+06)
cluster.rtbd-staging.upstream_rq_time: P0(21.0,21.0) P25(24.4953,25.2731) P50(25.5169,26.2164) P75(26.5476,27.1985) P90(27.3783,28.4119) P95(27.786,30.424) P99(28.9561,69.5672) P99.5(30.1313,113.016) P99.9(35.1367,334.756) P100(39.0,15000.0)
cluster.wind-prod-v2.external.upstream_rq_time: P0(0.0,0.0) P25(0.0,0.0) P50(0.0,0.0) P75(0.0,0.0) P90(0.0,1.0386) P95(0.0,3.03342) P99(0.0,32.8201) P99.5(0.0,65.3433) P99.9(0.0,199.804) P100(0.0,1600.0)
cluster.wind-prod-v2.upstream_cx_connect_ms: P0(nan,0.0) P25(nan,0.0) P50(nan,16.2857) P75(nan,42.5) P90(nan,74.1) P95(nan,101.0) P99(nan,263.2) P99.5(nan,277.2) P99.9(nan,734.32) P100(nan,740.0)
cluster.wind-prod-v2.upstream_cx_length_ms: P0(nan,300000.) P25(nan,858750.) P50(nan,1.65909e+06) P75(nan,2.84062e+06) P90(nan,4.39e+06) P95(nan,5.985e+06) P99(nan,1.05617e+07) P99.5(nan,1.10617e+07) P99.9(nan,1.18123e+07) P100(nan,1.2e+07)
cluster.wind-prod-v2.upstream_rq_time: P0(0.0,0.0) P25(0.0,0.0) P50(0.0,0.0) P75(0.0,0.0) P90(0.0,1.0386) P95(0.0,3.03342) P99(0.0,32.8201) P99.5(0.0,65.3433) P99.9(0.0,199.804) P100(0.0,1600.0)
http.admin.downstream_cx_length_ms: P0(0.0,0.0) P25(0.0,0.0) P50(0.0,0.0) P75(0.0,0.0) P90(0.0,0.0) P95(0.0,0.0) P99(0.0,1.00791) P99.5(0.0,2.05736) P99.9(0.0,27.224) P100(0.0,2.5e+08)
http.admin.downstream_rq_time: P0(0.0,0.0) P25(0.0,0.0) P50(0.0,0.0) P75(0.0,0.0) P90(4.05,3.09519) P95(4.075,4.07041) P99(4.095,13.5023) P99.5(4.0975,25.0353) P99.9(4.0995,205.751) P100(4.1,490.0)
http.ingress_http.downstream_cx_length_ms: P0(nan,0.0) P25(nan,3503.85) P50(nan,42373.6) P75(nan,86809.1) P90(nan,148048.) P95(nan,201285.) P99(nan,317895.) P99.5(nan,369358.) P99.9(nan,491335.) P100(nan,1.4e+06)
http.ingress_http.downstream_rq_time: P0(0.0,0.0) P25(0.0,0.0) P50(0.0,0.0) P75(0.0,0.0) P90(1.00138,1.08627) P95(1.05437,3.07838) P99(1.09676,35.2558) P99.5(2.06885,69.2689) P99.9(7.08475,179.255) P100(14.0,2300.0)
http.ingress_http.user_agent.ios.downstream_cx_length_ms: P0(nan,70000.0) P25(nan,71000.0) P50(nan,74000.0) P75(nan,85000.0) P90(nan,166000.) P95(nan,168000.) P99(nan,169600.) P99.5(nan,169800.) P99.9(nan,169960.) P100(nan,170000.)
listener.0.0.0.0_80.downstream_cx_length_ms: P0(nan,0.0) P25(nan,3503.85) P50(nan,42373.5) P75(nan,86809.1) P90(nan,148048.) P95(nan,201285.) P99(nan,317895.) P99.5(nan,369358.) P99.9(nan,491335.) P100(nan,1.4e+06)
listener.admin.downstream_cx_length_ms: P0(0.0,0.0) P25(0.0,0.0) P50(0.0,0.0) P75(0.0,0.0) P90(0.0,0.0) P95(0.0,0.0) P99(0.0,1.01237) P99.5(0.0,2.06713) P99.9(0.0,28.1129) P100(0.0,2.5e+08)
server.initialization_time_ms: P0(nan,14.0) P25(nan,14.25) P50(nan,14.5) P75(nan,14.75) P90(nan,14.9) P95(nan,14.95) P99(nan,14.99) P99.5(nan,14.995) P99.9(nan,14.999) P100(nan,15.0)
Unfortunately, I'm only able to reproduce this behavior in production, and, when enabling trace log level, I get a lot of lof errors. I was still able to gather some logs which might prove useful:
[2020-07-10 08:36:47.374][11][trace][main] [source/common/event/dispatcher_impl.cc:163] item added to deferred deletion list (size=1094)
[2020-07-10 08:36:47.374][11][trace][main] [source/common/event/dispatcher_impl.cc:163] item added to deferred deletion list (size=1093)
[2020-07-10 08:36:47.374][11][debug][pool] [source/common/http/conn_pool_base.cc:265] [C61207] client disconnected, failure reason:
[2020-07-10 08:36:47.374][11][debug][client] [source/common/http/codec_client.cc:91] [C61207] disconnect. resetting 0 pending requests
[2020-07-10 08:36:47.374][11][debug][connection] [source/common/network/connection_impl.cc:200] [C61207] closing socket: 1
[2020-07-10 08:36:47.374][11][debug][connection] [source/common/network/connection_impl.cc:109] [C61207] closing data_to_write=0 type=1
[2020-07-10 08:36:47.374][11][trace][main] [source/common/event/dispatcher_impl.cc:163] item added to deferred deletion list (size=1092)
[2020-07-10 08:36:47.374][11][debug][router] [source/common/router/upstream_request.cc:263] [C60447][S3734311586874250871] resetting pool request
[2020-07-10 08:36:47.374][11][debug][http] [source/common/http/conn_manager_impl.cc:1936] [C60447][S3734311586874250871] stream reset
[2020-07-10 08:36:47.374][11][debug][connection] [source/common/network/connection_impl.cc:200] [C60447] closing socket: 0
[2020-07-10 08:36:47.374][11][debug][connection] [source/common/network/connection_impl.cc:514] [C60447] remote early close
[2020-07-10 08:36:47.374][11][trace][connection] [source/common/network/connection_impl.cc:493] [C60447] socket event: 6
[2020-07-10 08:36:47.374][11][trace][connection] [source/common/network/raw_buffer_socket.cc:68] [C61261] write returns: 1365
[2020-07-10 08:36:47.374][11][trace][connection] [source/common/network/connection_impl.cc:581] [C61261] write ready
[2020-07-10 08:36:47.436][12][trace][connection] [source/common/network/raw_buffer_socket.cc:39] [C61266] read error: Resource temporarily unavailable
[2020-07-10 08:36:47.437][12][debug][pool] [source/common/http/conn_pool_base.cc:265] [C61256] client disconnected, failure reason:
[2020-07-10 08:36:47.437][12][debug][client] [source/common/http/codec_client.cc:91] [C61256] disconnect. resetting 0 pending requests
[2020-07-10 08:36:47.437][12][debug][connection] [source/common/network/connection_impl.cc:200] [C61256] closing socket: 1
If you need more logs, I will try to enable only debug level again.
Hi. Any chance someone would have some time to take a look ? Thanks a lot.
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.
/remove-lifecycle stale
Hello, we are still having issues with high number of connections dropped. I tried many different configuration scenarios, but no luck so far.
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.
/remove-lifecycle stale
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.
This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.
Hi,
I have been experiencing an issue with destroyed upstream connections for a long time now, producing 503 errors and higher response times. After digging both Envoy, application and Kubernetes nodes system logs, and trying to tweak some Kernel net settings for quite a while, I could find any explanation to this issue.
Envoy 1.14.1 is deployed with hostPort , 3 replicas, in an EKS 1.16 cluster behind an ALB. Upstream cluster is a headless service (Real Time Bidding fasthttp go application) with 6 pods running. There is an average of 650 Keep Alive connections opened from clients (service providers) sending approximately 4000 requests/second.
Below is an example of Envoy dashboard for this upstream service:
Here is example of cluster stats showing destroyed connections:
Both Envoy and the application deployment have the following sysctls configured:
Example Envoy configuration:
I have to point out the fact that none of the issue appears with the same trafic load when routing traffic directly to the application. I have also tried to disable mirroring and scale up application pods without getting any satisfying results.
Would you have any advice on how to solve this issue ?