elastic / beats

:tropical_fish: Beats - Lightweight shippers for Elasticsearch & Logstash
https://www.elastic.co/products/beats
Other
103 stars 4.92k forks source link

Polling behavior with Cloudwatch metrics not matching expectations #38555

Open jkomara opened 8 months ago

jkomara commented 8 months ago

I have been working with @kaiyan-sheng on importing AWS MSK metrics with elastic agent. I am using the elastic-agent standalone deployment template for kubernetes to deploy elastic-agent using a standalone configuration generated by the AWS MSK Integration. This was initially set up to poll every minute by setting period to 1m. The cost that was incurred for this frequency was high so the period was set to 5m and data_granularity was set to 1m. When I checked back in on the usage report, the number of metrics being requests and the frequency of the GetMetricData (GMD) API calls had not changed. You could see that it was still being requested every 1m.

I ran through a few iterations to try to see if I could get it to behave as I would expect. I set the polling period to 10m and the data_granularity to 1m.

image

In the image above you can see up until 16:44 that GMD and the ListMetrics API were being hit every minute. My expectation would be that you would see them being hit every 10 minutes. You can see a break in the graph. That is where I updated the data_granularity to 2m. When polling resumes at 16:46 you can see that the APIs are still being hit every minute.

I then updated the data_granularity to 5m and saw the behavior that I was expecting.

image

In this image you can see that GMD is being hit every 10 minutes as expected and I verified that I have datapoints every 5 minutes in my ES cluster as expected. I am waiting on the usage and cost data to populate in AWS Cost Explorer so that I can see how many metrics were fetched and the cost associated with them. I can add that here when it is ready if that helps at all. I have included the configuration that I am using. The only change I am making is updating the period or the data_granularity.

Input Config ```yaml inputs: - id: aws/metrics-kafka-UUID name: aws-1 revision: 1 type: aws/metrics use_output: default meta: package: name: aws version: 2.13.0 data_stream: namespace: default package_policy_id: UUID streams: - id: aws/metrics-aws.kafka_metrics-UUID data_stream: dataset: aws.kafka_metrics type: metrics metricsets: - cloudwatch period: 10m data_granularity: 5m access_key_id: *** secret_access_key: *** regions: - eu-west-1 latency: 5m tags_filter: null metrics: - name: - BytesInPerSec - BytesOutPerSec - EstimatedMaxTimeLag - FetchMessageConversionsPerSec - MaxOffsetLag - MessagesInPerSec - ProduceMessageConversionsPerSec - SumOffsetLag namespace: AWS/Kafka resource_type: kafka statistic: - Sum ```
elasticmachine commented 8 months ago

Pinging @elastic/obs-ds-hosted-services (Team:obs-ds-hosted-services)

jkomara commented 7 months ago

I have an update. I believe that this was an issue with my setup. I am running the official elastic-agent container. During recent testing I noticed a lot of these error messages:

Unit state changed aws/metrics-default-aws/metrics-kafka-7f32abc7-0094-4916-8e6b-34f22bf9c1bb (HEALTHY->STOPPED): Suppressing FAILED state due to restart for '843' exited with code '-1'

In addition I was no longer seeing metrics show up in my ES cluster. After some troubleshooting, @agunnerson-elastic realized that the process for aws/metrics-default-aws/metrics-kafka-7f32abc7-0094-4916-8e6b-34f22bf9c1bb was being OOM killed. This was not easy to figure out since the pod was never restarted and memory usage graphs showed memory use in an acceptable range. Since elastic-agent is the entry point process for the container and the Kafka component is managed by elastic-agent, the pod continued to run and endlessly restart the aws/metrics-default-aws/metrics-kafka-7f32abc7-0094-4916-8e6b-34f22bf9c1bb process.

My theory for why I was seeing data and increased polling cycles before and seeing nothing this time is the volume of the metrics. When I first ran this we were running with ~20k metrics and this time around ~40k metrics. I am wondering if 20k was enough for metrics to start being indexed and then the process would crash. With 40k metrics it was just too much and the process crashed before it could index the data. There were a few instances with 40k metrics where some data did show up.

After giving the pod more memory, I tested polling at a 5m interval with 1m granularity and things are looking good.

image

I am going to continue to monitor and try some different polling and granularity intervals to see if the issue shows up again.

jkomara commented 7 months ago

The data from yesterday was just made available in Cost Explorer and I can confirm that our usage and billing rates are in line with what we would expect them to be.

image

( Ignore the first two hours. That was me testing )

@kaiyan-sheng do you want me to leave this open to investigate the OOMKilled issue or close it out?

kaiyan-sheng commented 6 months ago

Thank you so much @jkomara for the testing!! I will probably open a separate issue for OOMKilled and link it to this one. But let's keep this one open till I have a new issue created. Thanks again!!