This PR fixes the telemetry spike issue after removing telegraf. This fixes the bug in the ticker in telemetry aggregation and incorporates mutex locks for correctly getting metric values.
customMetrics
| where customDimensions contains "testrecalertssohamksm"
| extend agentversion=tostring(customDimensions.agentversion)
|where agentversion !contains "win"
| where customDimensions.agentversion contains "6.8.14-fixhispiketelemetry-06-11-2024-b477f427"
|extend agentversion=strcat(agentversion, "/", name)
| summarize count() by bin(timestamp,5m),agentversion
| render timechart
The below screenshot shows that volume has gone down now with the ticker fix. The telemetry spike was happening on the below metrics which uses the ticker - otelcollector_cpu_usage_050,otelcollector_cpu_usage_095,metricsextension_cpu_usage_050,metricsextension_cpu_usage_095,metricsextension_memory_rss_050, metricsextension_memory_rss_095,otelcollector_memory_rss_050,otelcollector_memory_rss_095.
The memory usage of the pods is not high anymore.
New Feature Checklist
[ ] List telemetry added about the feature.
[ ] Link to the one-pager about the feature.
[ ] List any tasks necessary for release (3P docs, AKS RP chart changes, etc.) after merging the PR.
[ ] Attach results of scale and perf testing.
Tests Checklist
[ ] Have end-to-end Ginkgo tests been run on your cluster and passed? To bootstrap your cluster to run the tests, follow these instructions.
Labels used when running the tests on your cluster:
[ ] operator
[ ] windows
[ ] arm64
[ ] arc-extension
[ ] fips
[ ] Have new tests been added? For features, have tests been added for this feature? For fixes, is there a test that could have caught this issue and could validate that the fix works?
[ ] Is a new scrape job needed?
[ ] The scrape job was added to the folder test-cluster-yamls in the correct configmap or as a CR.
[ ] Was a new test label added?
[ ] A string constant for the label was added to constants.go.
[ ] The label and description was added to the test README.
PR Description
This PR fixes the telemetry spike issue after removing telegraf. This fixes the bug in the ticker in telemetry aggregation and incorporates mutex locks for correctly getting metric values.
test cluster
test image: 6.8.14-fixhispiketelemetry-06-11-2024-b477f427
AI resource
AI query:
customMetrics | where customDimensions contains "testrecalertssohamksm" | extend agentversion=tostring(customDimensions.agentversion) |where agentversion !contains "win" | where customDimensions.agentversion contains "6.8.14-fixhispiketelemetry-06-11-2024-b477f427" |extend agentversion=strcat(agentversion, "/", name) | summarize count() by bin(timestamp,5m),agentversion | render timechart
The below screenshot shows that volume has gone down now with the ticker fix. The telemetry spike was happening on the below metrics which uses the ticker - otelcollector_cpu_usage_050,otelcollector_cpu_usage_095,metricsextension_cpu_usage_050,metricsextension_cpu_usage_095,metricsextension_memory_rss_050, metricsextension_memory_rss_095,otelcollector_memory_rss_050,otelcollector_memory_rss_095.
The memory usage of the pods is not high anymore.
New Feature Checklist
Tests Checklist
operator
windows
arm64
arc-extension
fips
/tests
) added?