stevenzzzz commented 2 years ago

Title: stats: lazy init stats to save RAM (and CPU)

Description:

With lots of clusters and route-tables in a cloud proxy, we are seeing tons of RAM been spent on stats while most of the stats are never inc-ed due to traffic pattern(or long tail). We are thinking that we can lazy init cluster stats() so that the RAM is only allocated when it's required.

To achieve that we need to have finer grained stats group, e.g. configUpdateStats() are frequently updated by config management server, while upstream_xxx are only required when there is traffic for the cluster, for this sub-group we can save RAM by lazy init it.

Here is an example of cluster stats, when there is no traffic, Envoy is still paying the RAM cost to hold all the 0 stats, as well as the CPU that's burnt on collecting such stats.

6H22NSL5KWcwJH8

[optional Relevant Links:]

Any extra documentation required to understand the issue.

stevenzzzz commented 2 years ago

cc @jmarantz

stevenzzzz commented 2 years ago

This is the subgroups I plan to have for clusters.

/**

All cluster config update related stats. */

define ALL_CLUSTER_CONFIG_UPDATE_STATS(COUNTER, GAUGE, HISTOGRAM,

TEXT_READOUT, STATNAME) COUNTER(assignment_stale) COUNTER(assignment_timeout_received) COUNTER(update_attempt) COUNTER(update_empty) COUNTER(update_failure) COUNTER(update_no_rebuild) COUNTER(update_success) GAUGE(version, NeverImport) /**

All cluster endpoints related stats. */

define ALL_CLUSTER_ENDPOINT_STATS(COUNTER, GAUGE, HISTOGRAM, TEXT_READOUT,

STATNAME) GAUGE(max_host_weight, NeverImport) COUNTER(membership_change) GAUGE(membership_degraded, NeverImport) GAUGE(membership_excluded, NeverImport) GAUGE(membership_healthy, NeverImport) GAUGE(membership_total, NeverImport) /**

All cluster loadbalancing related stats. */

define ALL_CLUSTER_LB_STATS(COUNTER, GAUGE, HISTOGRAM, TEXT_READOUT,

STATNAME) COUNTER(lb_healthy_panic) COUNTER(lb_local_cluster_not_ok) COUNTER(lb_recalculate_zone_structures) COUNTER(lb_subsets_created) COUNTER(lb_subsets_fallback) COUNTER(lb_subsets_fallback_panic) COUNTER(lb_subsets_removed) COUNTER(lb_subsets_selected) COUNTER(lb_zone_cluster_too_small) COUNTER(lb_zone_no_capacity_left) COUNTER(lb_zone_number_differs) COUNTER(lb_zone_routing_all_directly) COUNTER(lb_zone_routing_cross_zone) COUNTER(lb_zone_routing_sampled) GAUGE(lb_subsets_active, Accumulate) /**

All cluster stats. https://github.com/see stats_macros.h */

define ALL_CLUSTER_STATS(COUNTER, GAUGE, HISTOGRAM, TEXT_READOUT, STATNAME)

COUNTER(bind_errors) COUNTER(original_dst_host_invalid) COUNTER(retry_or_shadow_abandoned) COUNTER(upstream_cx_close_notify) COUNTER(upstream_cx_connect_attempts_exceeded) COUNTER(upstream_cx_connect_fail) COUNTER(upstream_cx_connect_timeout) COUNTER(upstream_cx_connect_with_0_rtt) COUNTER(upstream_cx_destroy) COUNTER(upstream_cx_destroy_local) COUNTER(upstream_cx_destroy_local_with_active_rq) COUNTER(upstream_cx_destroy_remote) COUNTER(upstream_cx_destroy_remote_with_active_rq) COUNTER(upstream_cx_destroy_with_active_rq) COUNTER(upstream_cx_http1_total) COUNTER(upstream_cx_http2_total) COUNTER(upstream_cx_http3_total) COUNTER(upstream_cx_idle_timeout) COUNTER(upstream_cx_max_duration_reached) COUNTER(upstream_cx_max_requests) COUNTER(upstream_cx_none_healthy) COUNTER(upstream_cx_overflow) COUNTER(upstream_cx_pool_overflow) COUNTER(upstream_cx_protocol_error) COUNTER(upstream_cx_rx_bytes_total) COUNTER(upstream_cx_total) COUNTER(upstream_cx_tx_bytes_total) COUNTER(upstream_flow_control_backed_up_total) COUNTER(upstream_flow_control_drained_total) COUNTER(upstream_flow_control_paused_reading_total) COUNTER(upstream_flow_control_resumed_reading_total) COUNTER(upstream_internal_redirect_failed_total) COUNTER(upstream_internal_redirect_succeeded_total) COUNTER(upstream_rq_cancelled) COUNTER(upstream_rq_completed) COUNTER(upstream_rq_maintenance_mode) COUNTER(upstream_rq_max_duration_reached) COUNTER(upstream_rq_pending_failure_eject) COUNTER(upstream_rq_pending_overflow) COUNTER(upstream_rq_pending_total) COUNTER(upstream_rq_0rtt) COUNTER(upstream_rq_per_try_timeout) COUNTER(upstream_rq_per_try_idle_timeout) COUNTER(upstream_rq_retry) COUNTER(upstream_rq_retry_backoff_exponential) COUNTER(upstream_rq_retry_backoff_ratelimited) COUNTER(upstream_rq_retry_limit_exceeded) COUNTER(upstream_rq_retry_overflow) COUNTER(upstream_rq_retry_success) COUNTER(upstream_rq_rx_reset) COUNTER(upstream_rq_timeout) COUNTER(upstream_rq_total) COUNTER(upstream_rq_tx_reset) COUNTER(upstream_http3_broken) GAUGE(upstream_cx_active, Accumulate) GAUGE(upstream_cx_rx_bytes_buffered, Accumulate) GAUGE(upstream_cx_tx_bytes_buffered, Accumulate) GAUGE(upstream_rq_active, Accumulate) GAUGE(upstream_rq_pending_active, Accumulate) HISTOGRAM(upstream_cx_connect_ms, Milliseconds) HISTOGRAM(upstream_cx_length_ms, Milliseconds)

wbpcode commented 2 years ago

cc @jmarantz

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

envoyproxy / envoy

[stats]: lazy init stats to save RAM (and CPU) #23575

define ALL_CLUSTER_CONFIG_UPDATE_STATS(COUNTER, GAUGE, HISTOGRAM,

define ALL_CLUSTER_ENDPOINT_STATS(COUNTER, GAUGE, HISTOGRAM, TEXT_READOUT,

define ALL_CLUSTER_LB_STATS(COUNTER, GAUGE, HISTOGRAM, TEXT_READOUT,

define ALL_CLUSTER_STATS(COUNTER, GAUGE, HISTOGRAM, TEXT_READOUT, STATNAME)