cloudflare / cloudflared

Cloudflare Tunnel client (formerly Argo Tunnel)
https://developers.cloudflare.com/cloudflare-one/connections/connect-apps/install-and-setup/tunnel-guide
Apache License 2.0
8.41k stars 732 forks source link

🐛 Release 2023.5.1 broke some metrics - quic_client_receive_bytes is missing. #1098

Open darron opened 7 months ago

darron commented 7 months ago

Describe the bug

Since the 2023.5.1 release - quic_client_receive_bytes is no longer present in the Prometheus /metrics page and is missing overall.

To Reproduce Steps to reproduce the behavior:

  1. Run 2023.5.0 - see metrics appear just like they should.
  2. Upgrade to 2023.5.1- metrics disappear.
  3. Revert back to old version - metrics re-appear.
  4. Update to newest version 2023.10.0 - metrics are still missing.

Screenshot 2023-11-03 at 1 30 45 PM

Expected behavior I expect the metrics to be around - we use them for our Grafana dashboards.

NOTE: There might be other metrics missing - that's the one I noticed as we use it.

Environment and versions

Logs and errors

The /metrics page doesn't have those values anymore:

# HELP build_info Build and version information
# TYPE build_info gauge
build_info{goversion="go1.20.6",revision="2023-10-31-1231 UTC",type="",version="2023.10.0"} 1
# HELP cloudflared_config_local_config_pushes Number of local configuration pushes to the edge
# TYPE cloudflared_config_local_config_pushes counter
cloudflared_config_local_config_pushes 1
# HELP cloudflared_config_local_config_pushes_errors Number of errors occurred during local configuration pushes
# TYPE cloudflared_config_local_config_pushes_errors counter
cloudflared_config_local_config_pushes_errors 0
# HELP cloudflared_orchestration_config_version Configuration Version
# TYPE cloudflared_orchestration_config_version gauge
cloudflared_orchestration_config_version 0
# HELP cloudflared_tcp_active_sessions Concurrent count of TCP sessions that are being proxied to any origin
# TYPE cloudflared_tcp_active_sessions gauge
cloudflared_tcp_active_sessions 0
# HELP cloudflared_tcp_total_sessions Total count of TCP sessions that have been proxied to any origin
# TYPE cloudflared_tcp_total_sessions counter
cloudflared_tcp_total_sessions 0
# HELP cloudflared_tunnel_active_streams Number of active streams created by all muxers.
# TYPE cloudflared_tunnel_active_streams gauge
cloudflared_tunnel_active_streams 0
# HELP cloudflared_tunnel_concurrent_requests_per_tunnel Concurrent requests proxied through each tunnel
# TYPE cloudflared_tunnel_concurrent_requests_per_tunnel gauge
cloudflared_tunnel_concurrent_requests_per_tunnel 0
# HELP cloudflared_tunnel_ha_connections Number of active ha connections
# TYPE cloudflared_tunnel_ha_connections gauge
cloudflared_tunnel_ha_connections 4
# HELP cloudflared_tunnel_request_errors Count of error proxying to origin
# TYPE cloudflared_tunnel_request_errors counter
cloudflared_tunnel_request_errors 0
# HELP cloudflared_tunnel_response_by_code Count of responses by HTTP status code
# TYPE cloudflared_tunnel_response_by_code counter
cloudflared_tunnel_response_by_code{status_code="200"} 1782
cloudflared_tunnel_response_by_code{status_code="302"} 4
cloudflared_tunnel_response_by_code{status_code="304"} 249
cloudflared_tunnel_response_by_code{status_code="500"} 1
# HELP cloudflared_tunnel_server_locations Where each tunnel is connected to. 1 means current location, 0 means previous locations.
# TYPE cloudflared_tunnel_server_locations gauge
cloudflared_tunnel_server_locations{connection_id="0",edge_location="sea06"} 1
cloudflared_tunnel_server_locations{connection_id="1",edge_location="lax08"} 1
cloudflared_tunnel_server_locations{connection_id="2",edge_location="lax06"} 1
cloudflared_tunnel_server_locations{connection_id="3",edge_location="sea05"} 1
# HELP cloudflared_tunnel_timer_retries Unacknowledged heart beats count
# TYPE cloudflared_tunnel_timer_retries gauge
cloudflared_tunnel_timer_retries 0
# HELP cloudflared_tunnel_total_requests Amount of requests proxied through all the tunnels
# TYPE cloudflared_tunnel_total_requests counter
cloudflared_tunnel_total_requests 2061
# HELP cloudflared_tunnel_tunnel_authenticate_success Count of successful tunnel authenticate
# TYPE cloudflared_tunnel_tunnel_authenticate_success counter
cloudflared_tunnel_tunnel_authenticate_success 0
# HELP cloudflared_tunnel_tunnel_register_success Count of successful tunnel registrations
# TYPE cloudflared_tunnel_tunnel_register_success counter
cloudflared_tunnel_tunnel_register_success{rpcName="registerConnection"} 4
# HELP cloudflared_udp_active_sessions Concurrent count of UDP sessions that are being proxied to any origin
# TYPE cloudflared_udp_active_sessions gauge
cloudflared_udp_active_sessions 0
# HELP cloudflared_udp_total_sessions Total count of UDP sessions that have been proxied to any origin
# TYPE cloudflared_udp_total_sessions gauge
cloudflared_udp_total_sessions 0
# HELP coredns_panics_total A metrics that counts the number of panics.
# TYPE coredns_panics_total counter
coredns_panics_total 0
# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 7.8504e-05
go_gc_duration_seconds{quantile="0.25"} 0.000182535
go_gc_duration_seconds{quantile="0.5"} 0.000198583
go_gc_duration_seconds{quantile="0.75"} 0.000250047
go_gc_duration_seconds{quantile="1"} 0.000356444
go_gc_duration_seconds_sum 0.007677076
go_gc_duration_seconds_count 35
# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 84
# HELP go_info Information about the Go environment.
# TYPE go_info gauge
go_info{version="go1.20.6"} 1
# HELP go_memstats_alloc_bytes Number of bytes allocated and still in use.
# TYPE go_memstats_alloc_bytes gauge
go_memstats_alloc_bytes 7.381856e+06
# HELP go_memstats_alloc_bytes_total Total number of bytes allocated, even if freed.
# TYPE go_memstats_alloc_bytes_total counter
go_memstats_alloc_bytes_total 1.67850952e+08
# HELP go_memstats_buck_hash_sys_bytes Number of bytes used by the profiling bucket hash table.
# TYPE go_memstats_buck_hash_sys_bytes gauge
go_memstats_buck_hash_sys_bytes 1.57094e+06
# HELP go_memstats_frees_total Total number of frees.
# TYPE go_memstats_frees_total counter
go_memstats_frees_total 2.022908e+06
# HELP go_memstats_gc_sys_bytes Number of bytes used for garbage collection system metadata.
# TYPE go_memstats_gc_sys_bytes gauge
go_memstats_gc_sys_bytes 8.879032e+06
# HELP go_memstats_heap_alloc_bytes Number of heap bytes allocated and still in use.
# TYPE go_memstats_heap_alloc_bytes gauge
go_memstats_heap_alloc_bytes 7.381856e+06
# HELP go_memstats_heap_idle_bytes Number of heap bytes waiting to be used.
# TYPE go_memstats_heap_idle_bytes gauge
go_memstats_heap_idle_bytes 5.9392e+06
# HELP go_memstats_heap_inuse_bytes Number of heap bytes that are in use.
# TYPE go_memstats_heap_inuse_bytes gauge
go_memstats_heap_inuse_bytes 9.95328e+06
# HELP go_memstats_heap_objects Number of allocated objects.
# TYPE go_memstats_heap_objects gauge
go_memstats_heap_objects 51589
# HELP go_memstats_heap_released_bytes Number of heap bytes released to OS.
# TYPE go_memstats_heap_released_bytes gauge
go_memstats_heap_released_bytes 4.947968e+06
# HELP go_memstats_heap_sys_bytes Number of heap bytes obtained from system.
# TYPE go_memstats_heap_sys_bytes gauge
go_memstats_heap_sys_bytes 1.589248e+07
# HELP go_memstats_last_gc_time_seconds Number of seconds since 1970 of last garbage collection.
# TYPE go_memstats_last_gc_time_seconds gauge
go_memstats_last_gc_time_seconds 1.699034395502704e+09
# HELP go_memstats_lookups_total Total number of pointer lookups.
# TYPE go_memstats_lookups_total counter
go_memstats_lookups_total 0
# HELP go_memstats_mallocs_total Total number of mallocs.
# TYPE go_memstats_mallocs_total counter
go_memstats_mallocs_total 2.074497e+06
# HELP go_memstats_mcache_inuse_bytes Number of bytes in use by mcache structures.
# TYPE go_memstats_mcache_inuse_bytes gauge
go_memstats_mcache_inuse_bytes 2400
# HELP go_memstats_mcache_sys_bytes Number of bytes used for mcache structures obtained from system.
# TYPE go_memstats_mcache_sys_bytes gauge
go_memstats_mcache_sys_bytes 31200
# HELP go_memstats_mspan_inuse_bytes Number of bytes in use by mspan structures.
# TYPE go_memstats_mspan_inuse_bytes gauge
go_memstats_mspan_inuse_bytes 146400
# HELP go_memstats_mspan_sys_bytes Number of bytes used for mspan structures obtained from system.
# TYPE go_memstats_mspan_sys_bytes gauge
go_memstats_mspan_sys_bytes 212160
# HELP go_memstats_next_gc_bytes Number of heap bytes when next garbage collection will take place.
# TYPE go_memstats_next_gc_bytes gauge
go_memstats_next_gc_bytes 1.546172e+07
# HELP go_memstats_other_sys_bytes Number of bytes used for other system allocations.
# TYPE go_memstats_other_sys_bytes gauge
go_memstats_other_sys_bytes 795446
# HELP go_memstats_stack_inuse_bytes Number of bytes in use by the stack allocator.
# TYPE go_memstats_stack_inuse_bytes gauge
go_memstats_stack_inuse_bytes 884736
# HELP go_memstats_stack_sys_bytes Number of bytes obtained from system for stack allocator.
# TYPE go_memstats_stack_sys_bytes gauge
go_memstats_stack_sys_bytes 884736
# HELP go_memstats_sys_bytes Number of bytes obtained from system.
# TYPE go_memstats_sys_bytes gauge
go_memstats_sys_bytes 2.8265994e+07
# HELP go_threads Number of OS threads created.
# TYPE go_threads gauge
go_threads 8
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 18.33
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1.048576e+06
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 18
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 4.3446272e+07
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.69903043738e+09
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 7.63011072e+08
# HELP process_virtual_memory_max_bytes Maximum amount of virtual memory available in bytes.
# TYPE process_virtual_memory_max_bytes gauge
process_virtual_memory_max_bytes 1.8446744073709552e+19
# HELP promhttp_metric_handler_requests_in_flight Current number of scrapes being served.
# TYPE promhttp_metric_handler_requests_in_flight gauge
promhttp_metric_handler_requests_in_flight 1
# HELP promhttp_metric_handler_requests_total Total number of scrapes by HTTP status code.
# TYPE promhttp_metric_handler_requests_total counter
promhttp_metric_handler_requests_total{code="200"} 134
promhttp_metric_handler_requests_total{code="500"} 0
promhttp_metric_handler_requests_total{code="503"} 0
# HELP quic_client_closed_connections Number of connections that has been closed
# TYPE quic_client_closed_connections counter
quic_client_closed_connections 0
# HELP quic_client_latest_rtt Latest RTT measured on a connection
# TYPE quic_client_latest_rtt gauge
quic_client_latest_rtt{conn_index="0"} 7
quic_client_latest_rtt{conn_index="1"} 25
quic_client_latest_rtt{conn_index="2"} 25
quic_client_latest_rtt{conn_index="3"} 8
# HELP quic_client_lost_packets Number of packets that have been lost from a connection
# TYPE quic_client_lost_packets counter
quic_client_lost_packets{conn_index="0",reason="timeout"} 1
quic_client_lost_packets{conn_index="1",reason="timeout"} 1
quic_client_lost_packets{conn_index="2",reason="timeout"} 1
quic_client_lost_packets{conn_index="3",reason="reordering"} 15
# HELP quic_client_min_rtt Lowest RTT measured on a connection in millisec
# TYPE quic_client_min_rtt gauge
quic_client_min_rtt{conn_index="0"} 6
quic_client_min_rtt{conn_index="1"} 24
quic_client_min_rtt{conn_index="2"} 24
quic_client_min_rtt{conn_index="3"} 6
# HELP quic_client_packet_too_big_dropped Count of packets received from origin that are too big to send to the edge and are dropped as a result
# TYPE quic_client_packet_too_big_dropped counter
quic_client_packet_too_big_dropped 0
# HELP quic_client_smoothed_rtt Calculated smoothed RTT measured on a connection in millisec
# TYPE quic_client_smoothed_rtt gauge
quic_client_smoothed_rtt{conn_index="0"} 8
quic_client_smoothed_rtt{conn_index="1"} 25
quic_client_smoothed_rtt{conn_index="2"} 25
quic_client_smoothed_rtt{conn_index="3"} 7
# HELP quic_client_total_connections Number of connections initiated. For all quic metrics, client means the side initiating the connection
# TYPE quic_client_total_connections counter
quic_client_total_connections 4

Additional context

It sort of feels like it happened due to this commit - but there's a ton of changes and I haven't dug any deeper:

https://github.com/cloudflare/cloudflared/commit/9426b603082905d0af8a07bdac866bc1d9c37cba

darron commented 5 months ago

Tested with the new 2024.1.3 release - still the same problem - the metrics dissappear.

Screenshot 2024-01-17 at 1 44 51 PM

Looking at metrics.go - I don't see an obvious culprit - it hasn't changed much in years.

darron commented 5 months ago

Here's a copy of the output of /metrics still missing the metrics we used to have:

https://gist.github.com/darron/791910fe1998ddca3ede7c9d4a5183bc

darron commented 5 months ago

Just found this:

https://github.com/quic-go/quic-go/issues/4077

Looks like maybe there is no metrics support in the new package - will keep watching.