Azure / prometheus-collector

Other
63 stars 36 forks source link

[Draft PR][Don't merge] Fix telemetry spike issue in telegraf removal #920

Closed Sohamdg081992 closed 3 months ago

Sohamdg081992 commented 3 months ago

PR Description

This PR fixes the telemetry spike issue after removing telegraf. This fixes the bug in the ticker in telemetry aggregation and incorporates mutex locks for correctly getting metric values.

test cluster

test image: 6.8.14-fixhispiketelemetry-06-11-2024-b477f427

AI resource

AI query:

customMetrics | where customDimensions contains "testrecalertssohamksm" | extend agentversion=tostring(customDimensions.agentversion) |where agentversion !contains "win" | where customDimensions.agentversion contains "6.8.14-fixhispiketelemetry-06-11-2024-b477f427" |extend agentversion=strcat(agentversion, "/", name) | summarize count() by bin(timestamp,5m),agentversion | render timechart

The below screenshot shows that volume has gone down now with the ticker fix. The telemetry spike was happening on the below metrics which uses the ticker - otelcollector_cpu_usage_050,otelcollector_cpu_usage_095,metricsextension_cpu_usage_050,metricsextension_cpu_usage_095,metricsextension_memory_rss_050, metricsextension_memory_rss_095,otelcollector_memory_rss_050,otelcollector_memory_rss_095.

image

The memory usage of the pods is not high anymore.

image

New Feature Checklist

Tests Checklist

Sohamdg081992 commented 3 months ago

Closing in favor of https://github.com/Azure/prometheus-collector/pull/926