Azure / prometheus-collector

Other
58 stars 35 forks source link

[Draft PR][Don't merge until upgrade release rolls out] Fix telemetry spike issue in telegraf removal #926

Open Sohamdg081992 opened 1 week ago

Sohamdg081992 commented 1 week ago

PR Description

This PR fixes the telemetry spike issue after removing telegraf. This fixes the bug in the ticker in telemetry aggregation and incorporates mutex locks for correctly getting metric values.

test cluster

test image: 6.8.14-fixhispiketelemetryNew-06-24-2024-c6cbed86

AI resource

AI query:

customMetrics | where customDimensions contains "testrecalertssohamksm" | extend agentversion=tostring(customDimensions.agentversion) |where agentversion !contains "win" | where customDimensions.agentversion contains "6.8.14-fixhispiketelemetryNew-06-24-2024-c6cbed86" |extend agentversion=strcat(agentversion, "/", name) | summarize count() by bin(timestamp,5m),agentversion | render timechart

The below screenshot shows that volume has gone down now with the ticker fix. The telemetry spike was happening on the below metrics which uses the ticker - otelcollector_cpu_usage_050,otelcollector_cpu_usage_095,metricsextension_cpu_usage_050,metricsextension_cpu_usage_095,metricsextension_memory_rss_050, metricsextension_memory_rss_095,otelcollector_memory_rss_050,otelcollector_memory_rss_095.

image

The memory usage of the pods is not high anymore.

image

New Feature Checklist

Tests Checklist

github-actions[bot] commented 2 days ago

This PR is stale because it has been open 7 days with no activity. Remove stale label or comment or this will be closed in 5 days.