PR Description

This PR fixes the telemetry spike issue after removing telegraf. This fixes the bug in the ticker in telemetry aggregation and incorporates mutex locks for correctly getting metric values.

test cluster

test image: 6.8.14-fixhispiketelemetry-06-11-2024-b477f427

AI resource

AI query:

customMetrics | where customDimensions contains "testrecalertssohamksm" | extend agentversion=tostring(customDimensions.agentversion) |where agentversion !contains "win" | where customDimensions.agentversion contains "6.8.14-fixhispiketelemetry-06-11-2024-b477f427" |extend agentversion=strcat(agentversion, "/", name) | summarize count() by bin(timestamp,5m),agentversion | render timechart

The below screenshot shows that volume has gone down now with the ticker fix. The telemetry spike was happening on the below metrics which uses the ticker - otelcollector_cpu_usage_050,otelcollector_cpu_usage_095,metricsextension_cpu_usage_050,metricsextension_cpu_usage_095,metricsextension_memory_rss_050, metricsextension_memory_rss_095,otelcollector_memory_rss_050,otelcollector_memory_rss_095.

The memory usage of the pods is not high anymore.

New Feature Checklist

[ ] List telemetry added about the feature.
[ ] Link to the one-pager about the feature.
[ ] List any tasks necessary for release (3P docs, AKS RP chart changes, etc.) after merging the PR.
[ ] Attach results of scale and perf testing.

Tests Checklist

[ ] Have end-to-end Ginkgo tests been run on your cluster and passed? To bootstrap your cluster to run the tests, follow these instructions.
- Labels used when running the tests on your cluster:
- [ ] operator
- [ ] windows
- [ ] arm64
- [ ] arc-extension
- [ ] fips
[ ] Have new tests been added? For features, have tests been added for this feature? For fixes, is there a test that could have caught this issue and could validate that the fix works?
- [ ] Is a new scrape job needed?
- [ ] The scrape job was added to the folder test-cluster-yamls in the correct configmap or as a CR.
- [ ] Was a new test label added?
- [ ] A string constant for the label was added to constants.go.
- [ ] The label and description was added to the test README.
- [ ] The label was added to this PR checklist.
- [ ] The label was added as needed to testkube-test-crs.yaml.
- [ ] Are additional API server permissions needed for the new tests?
- [ ] These permissions have been added to api-server-permissions.yaml.
- [ ] Was a new test suite (a new folder under /tests) added?
- [ ] The new test suite is included in testkube-test-crs.yaml.

Azure / prometheus-collector

[Draft PR][Don't merge] Fix telemetry spike issue in telegraf removal #920

PR Description

New Feature Checklist

Tests Checklist