Open AndyDevman opened 1 year ago
@AndyDevman, IIRC, the AWS Billing integration should be executed by one Agent only.
If you deploy the Agent as a daemonset, k8s will probably spawn one Agent on each node resulting in multiple Agents sending requests and potentially hitting API limits resulting in throttling.
Recommended next actions:
@zmoog , thanks for the reply
There was definitely one elastic agent per node as it's a daemonset and looking at source IP address associated with the GetCostAndUsage
api calls in the Cloudtrail events, I can see a match for each agent.
As of today, I have been testing running the elastic agent on only a single node in the cluster. For this testing, I've limited the scope of the daemonset using nodeSelector
referencing one of the nodes.
In this setup , when I enabled the AWS Billing Integration, I noticed that the metricbeat / billing process on the agent still appeared to stop / start intermittently. It did this over a period of about 8 minutes with approx 950 GetCostAndUsage
api calls appearing in Cloudtrail.
Interestingly when running the agent like this on a single node this morning, the api calls appear to have stopped after that 8 minutes of activity. I guess this indicates that the Billing integration considered that it had been successful for that 24 hour period and shouldn't run again.
It does feel like this 950 GetCostAndUsage
api calls is still quite a lot though. Is there way to estimate how many GetCostAndUsage
api calls should be expected for a given number of group by dimensions and tags ?
I have a feeling that if the AWS billing integration process was not stopping / starting that this number of calls would be quite a bit lower.
The ThrottlingException
error hasn't appeared in Cloudtrail. So from this initial look at today at least, reducing the number of agents and as a result GetCostAndUsage
API calls from AWS Billing Integration looks like it might keep things below the limit threshold.
I have been trying to collect the diagnostics logs via the cli but also from within the agent properties from Fleet UI but unfortunately every time I try to collect them the elastic agent restarts.
A result of this is that I noticed the metricbeat / billing process re runs and we see more GetCostAndUsage api calls which unfortunately come at a cost. I wonder if there is a way to set something for the billing integration on elastic agent so that once it has successfully collected the AWS Billing metrics within the set period , in my case 24hours, that it won't attempt to re run the billing metrics collection regardless of how many times the agent restarts. Perhaps this could be something that users can toggle on/off if they accept that there could be additional charges etc.
Hi! We just realized that we haven't looked into this issue in a while. We're sorry! We're labeling this issue as Stale
to make it hit our filters and make sure we get back to it as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1
. Thank you for your contribution!
Summary
Elastic Agent, AWS Billing Integration appears to repeatedly stop / start which causes a large amount of
GetCostAndUsage
api calls, This in turn incurs significant charges.Description
We have recently been retesting the Billing part of the AWS Integration for Elastic Agent in our K8s environment.
We have the Elastic Agents running as a daemonset in the k8s cluster and they are all managed by Fleet.
The versions we are running are as follows
Elastic Agent/Cluster Version: 8.7.1 AWS Integration Version: 1.42.0 EKS version: 1.25
We have configured the Billing settings with a period of 24hours.
Unfortunately since re enabling the Billing facet of the AWS Integration we have observed repeated stopping / starting of the AWS Billing process. Here is an excerpt from the Elastic Agents Logs.
As a result of this we are seeing repeated
GetCostAndUsage
api calls flagged in Cloudtrail.This is activity is subsequently reflected in the large CostExplorer spikes we are seeing.
This $1,000+ cost spike is unfortunately entirely related to the AWS Billing Integration.
We initially thought that the issue could be related to pod resource limits which are setup for our Elastic Agents - see below
But we monitored the Elastic Agent pods using
watch
during the period where we enabled the Billing Integration, and this didn't seem to indicate that that the resource limits were being hit.We also watched for events in the namespace where the elastic-agents are running and we didn't observe any tell-tale events such as OOM.
In the snapshot from Cloudtrail above, I have highlighted that we quickly hit the threshold for
GetCostAndUsage
api calls. This is represented by the Error CodeThrottlingException
Below is more detail on the
ThrottlingException
errorNote that the group by dimension keys which are referenced in this event are not always the same.