elastic / integrations

Elastic Integrations
https://www.elastic.co/integrations
Other
194 stars 422 forks source link

AWS Billing Integration causes large spike in CostExplorer charges #7350

Open AndyDevman opened 1 year ago

AndyDevman commented 1 year ago

Summary

Elastic Agent, AWS Billing Integration appears to repeatedly stop / start which causes a large amount of GetCostAndUsage api calls, This in turn incurs significant charges.

Description

We have recently been retesting the Billing part of the AWS Integration for Elastic Agent in our K8s environment.

We have the Elastic Agents running as a daemonset in the k8s cluster and they are all managed by Fleet.

The versions we are running are as follows

Elastic Agent/Cluster Version: 8.7.1 AWS Integration Version: 1.42.0 EKS version: 1.25

We have configured the Billing settings with a period of 24hours.

Unfortunately since re enabling the Billing facet of the AWS Integration we have observed repeated stopping / starting of the AWS Billing process. Here is an excerpt from the Elastic Agents Logs.

[elastic_agent][info] Unit state changed aws/metrics-default-aws/metrics-billing-.......... (HEALTHY->STOPPED): Suppressing FAILED state due to restart for '138198' exited with code '-1'

[elastic_agent][info] Spawned new unit aws/metrics-default-aws/metrics-billing-......................: Starting: spawned pid '138687'

[elastic_agent][info] Unit state changed aws/metrics-default-aws/metrics-billing-..................... (STARTING->HEALTHY): Healthy

[elastic_agent][info] Unit state changed aws/metrics-default-aws/metrics-billing-........................... (HEALTHY->STOPPED): Suppressing FAILED state due to restart for '138687' exited with code '-1'

[elastic_agent][info] Spawned new unit aws/metrics-default-aws/metrics-billing-.........................: Starting: spawned pid '139045'

[elastic_agent][info] Unit state changed aws/metrics-default-aws/metrics-billing-......... (STARTING->HEALTHY): Healthy

As a result of this we are seeing repeated GetCostAndUsage api calls flagged in Cloudtrail.

image

This is activity is subsequently reflected in the large CostExplorer spikes we are seeing.

image

This $1,000+ cost spike is unfortunately entirely related to the AWS Billing Integration.

We initially thought that the issue could be related to pod resource limits which are setup for our Elastic Agents - see below

image

But we monitored the Elastic Agent pods using watch during the period where we enabled the Billing Integration, and this didn't seem to indicate that that the resource limits were being hit.

image

We also watched for events in the namespace where the elastic-agents are running and we didn't observe any tell-tale events such as OOM.

In the snapshot from Cloudtrail above, I have highlighted that we quickly hit the threshold for GetCostAndUsage api calls. This is represented by the Error Code ThrottlingException

Below is more detail on the ThrottlingException error

image

Note that the group by dimension keys which are referenced in this event are not always the same.

zmoog commented 1 year ago

@AndyDevman, IIRC, the AWS Billing integration should be executed by one Agent only.

If you deploy the Agent as a daemonset, k8s will probably spawn one Agent on each node resulting in multiple Agents sending requests and potentially hitting API limits resulting in throttling.

Recommended next actions:

AndyDevman commented 1 year ago

@zmoog , thanks for the reply

There was definitely one elastic agent per node as it's a daemonset and looking at source IP address associated with the GetCostAndUsage api calls in the Cloudtrail events, I can see a match for each agent.

As of today, I have been testing running the elastic agent on only a single node in the cluster. For this testing, I've limited the scope of the daemonset using nodeSelector referencing one of the nodes.

In this setup , when I enabled the AWS Billing Integration, I noticed that the metricbeat / billing process on the agent still appeared to stop / start intermittently. It did this over a period of about 8 minutes with approx 950 GetCostAndUsage api calls appearing in Cloudtrail.

Interestingly when running the agent like this on a single node this morning, the api calls appear to have stopped after that 8 minutes of activity. I guess this indicates that the Billing integration considered that it had been successful for that 24 hour period and shouldn't run again. It does feel like this 950 GetCostAndUsage api calls is still quite a lot though. Is there way to estimate how many GetCostAndUsage api calls should be expected for a given number of group by dimensions and tags ? I have a feeling that if the AWS billing integration process was not stopping / starting that this number of calls would be quite a bit lower.

The ThrottlingException error hasn't appeared in Cloudtrail. So from this initial look at today at least, reducing the number of agents and as a result GetCostAndUsage API calls from AWS Billing Integration looks like it might keep things below the limit threshold.

I have been trying to collect the diagnostics logs via the cli but also from within the agent properties from Fleet UI but unfortunately every time I try to collect them the elastic agent restarts. image

A result of this is that I noticed the metricbeat / billing process re runs and we see more GetCostAndUsage api calls which unfortunately come at a cost. I wonder if there is a way to set something for the billing integration on elastic agent so that once it has successfully collected the AWS Billing metrics within the set period , in my case 24hours, that it won't attempt to re run the billing metrics collection regardless of how many times the agent restarts. Perhaps this could be something that users can toggle on/off if they accept that there could be additional charges etc.

botelastic[bot] commented 1 month ago

Hi! We just realized that we haven't looked into this issue in a while. We're sorry! We're labeling this issue as Stale to make it hit our filters and make sure we get back to it as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1. Thank you for your contribution!