Open stevenzzzz opened 2 years ago
cc @jmarantz
This is the subgroups I plan to have for clusters.
/**
All cluster config update related stats. */
TEXT_READOUT, STATNAME) COUNTER(assignment_stale) COUNTER(assignment_timeout_received) COUNTER(update_attempt) COUNTER(update_empty) COUNTER(update_failure) COUNTER(update_no_rebuild) COUNTER(update_success) GAUGE(version, NeverImport) /**
All cluster endpoints related stats. */
STATNAME) GAUGE(max_host_weight, NeverImport) COUNTER(membership_change) GAUGE(membership_degraded, NeverImport) GAUGE(membership_excluded, NeverImport) GAUGE(membership_healthy, NeverImport) GAUGE(membership_total, NeverImport) /**
All cluster loadbalancing related stats. */
STATNAME) COUNTER(lb_healthy_panic) COUNTER(lb_local_cluster_not_ok) COUNTER(lb_recalculate_zone_structures) COUNTER(lb_subsets_created) COUNTER(lb_subsets_fallback) COUNTER(lb_subsets_fallback_panic) COUNTER(lb_subsets_removed) COUNTER(lb_subsets_selected) COUNTER(lb_zone_cluster_too_small) COUNTER(lb_zone_no_capacity_left) COUNTER(lb_zone_number_differs) COUNTER(lb_zone_routing_all_directly) COUNTER(lb_zone_routing_cross_zone) COUNTER(lb_zone_routing_sampled) GAUGE(lb_subsets_active, Accumulate) /**
All cluster stats. https://github.com/see stats_macros.h */
COUNTER(bind_errors) COUNTER(original_dst_host_invalid) COUNTER(retry_or_shadow_abandoned) COUNTER(upstream_cx_close_notify) COUNTER(upstream_cx_connect_attempts_exceeded) COUNTER(upstream_cx_connect_fail) COUNTER(upstream_cx_connect_timeout) COUNTER(upstream_cx_connect_with_0_rtt) COUNTER(upstream_cx_destroy) COUNTER(upstream_cx_destroy_local) COUNTER(upstream_cx_destroy_local_with_active_rq) COUNTER(upstream_cx_destroy_remote) COUNTER(upstream_cx_destroy_remote_with_active_rq) COUNTER(upstream_cx_destroy_with_active_rq) COUNTER(upstream_cx_http1_total) COUNTER(upstream_cx_http2_total) COUNTER(upstream_cx_http3_total) COUNTER(upstream_cx_idle_timeout) COUNTER(upstream_cx_max_duration_reached) COUNTER(upstream_cx_max_requests) COUNTER(upstream_cx_none_healthy) COUNTER(upstream_cx_overflow) COUNTER(upstream_cx_pool_overflow) COUNTER(upstream_cx_protocol_error) COUNTER(upstream_cx_rx_bytes_total) COUNTER(upstream_cx_total) COUNTER(upstream_cx_tx_bytes_total) COUNTER(upstream_flow_control_backed_up_total) COUNTER(upstream_flow_control_drained_total) COUNTER(upstream_flow_control_paused_reading_total) COUNTER(upstream_flow_control_resumed_reading_total) COUNTER(upstream_internal_redirect_failed_total) COUNTER(upstream_internal_redirect_succeeded_total) COUNTER(upstream_rq_cancelled) COUNTER(upstream_rq_completed) COUNTER(upstream_rq_maintenance_mode) COUNTER(upstream_rq_max_duration_reached) COUNTER(upstream_rq_pending_failure_eject) COUNTER(upstream_rq_pending_overflow) COUNTER(upstream_rq_pending_total) COUNTER(upstream_rq_0rtt) COUNTER(upstream_rq_per_try_timeout) COUNTER(upstream_rq_per_try_idle_timeout) COUNTER(upstream_rq_retry) COUNTER(upstream_rq_retry_backoff_exponential) COUNTER(upstream_rq_retry_backoff_ratelimited) COUNTER(upstream_rq_retry_limit_exceeded) COUNTER(upstream_rq_retry_overflow) COUNTER(upstream_rq_retry_success) COUNTER(upstream_rq_rx_reset) COUNTER(upstream_rq_timeout) COUNTER(upstream_rq_total) COUNTER(upstream_rq_tx_reset) COUNTER(upstream_http3_broken) GAUGE(upstream_cx_active, Accumulate) GAUGE(upstream_cx_rx_bytes_buffered, Accumulate) GAUGE(upstream_cx_tx_bytes_buffered, Accumulate) GAUGE(upstream_rq_active, Accumulate) GAUGE(upstream_rq_pending_active, Accumulate) HISTOGRAM(upstream_cx_connect_ms, Milliseconds) HISTOGRAM(upstream_cx_length_ms, Milliseconds)
cc @jmarantz
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.
Title: stats: lazy init stats to save RAM (and CPU)
Description:
With lots of clusters and route-tables in a cloud proxy, we are seeing tons of RAM been spent on stats while most of the stats are never inc-ed due to traffic pattern(or long tail). We are thinking that we can lazy init cluster stats() so that the RAM is only allocated when it's required.
To achieve that we need to have finer grained stats group, e.g. configUpdateStats() are frequently updated by config management server, while upstream_xxx are only required when there is traffic for the cluster, for this sub-group we can save RAM by lazy init it.
Here is an example of cluster stats, when there is no traffic, Envoy is still paying the RAM cost to hold all the 0 stats, as well as the CPU that's burnt on collecting such stats.
[optional Relevant Links:]