elastic / integrations

Elastic Integrations
https://www.elastic.co/integrations
Other
194 stars 418 forks source link

Add support for Kubernetes Leader Election for integrations running as a K8s Daemon Set #7362

Open jamesagarside opened 1 year ago

jamesagarside commented 1 year ago

It would be great if the Kubernetes Leader Election provider was exposed to integrations so there isn't a risk of collecting data twice for certain integrations. https://www.elastic.co/guide/en/fleet/master/kubernetes_leaderelection-provider.html

An example of where this could be applied is the AWS Billing Integration. If you run this integration on a policy applied to a K8s Elastic Agent Daemon Set, the billing AWS API is polled by all the Agents within that Daemon Set and can quickly run up Cloud costs as discussed here #7350. The irony of collecting billing metrics generating bills is not lost.

Instead it would we great if on a per-integration basis it was possible to instruct an integration to only run on the Elastic Agent which is currently Leader.

This also could be defined on the processor level if the Kubernetes Leader Election variables were exposed to the integration, demonstrated below.

- drop_event:
    when: 
        ${kubernetes_leaderelection.leader}: false

This method however doesn't solve the problem of each Agent calling an API as the events are only dropped after collection but before sending to Elasticsearch.

I know we suggest running a second Agent as a Deployment with a dedicated policy but that adds more overhead than needed, especially when there is precedent for this functionality within the Kubernetes integration.

AndyDevman commented 1 month ago

I'd like to jump on this too.. It's really great that the leader election facility has recently been added for the billing part of the AWS integration but we'd really like to see this also introduced for the other parts of the AWS integration.

Two of the parts of the AWS integration where this would be particularly beneficial would be Cloudwatch Logs from CloudWatch and CloudTrail from CloudWatch

image

Background

If you apply either of these collectors from the AWS integration to a policy that is deployed to Elastic Agents that are running as a Daemonset in a Kubernetes cluster, we have observed that duplicate API calls are made from each elastic agent.

This leads to an increase in costs from the unnessicary duplicate API calls, but also a knock on Elastic cluster performance, which manifests itself as consistently very high JVM Heap usage and a sluggish Kibana UI.

For our specific use case, we were seeing at times approx 4000 docs per second which were being rejected by the Elastic cluster.

Unfortunately it isn't immediately obvious that these parts of the AWS integration behave in this manner. It has taken a while for us to workout why our cluster is running slowly.

Potential Workarounds

CloudTrail logs

For the CloudTrail logs, we can set it up so that our CloudTrail log source uses an S3 bucket with an SQS queue and pick it up with the Elastic Agents using this option from the AWS Integration

image

CloudWatch Logs

For CloudWatch logs there doesnt appear to be a built in option to enable in the AWS Integration that will enable the elastic agent to collect the Cloudwatch logs from S3/ SQS Queue.

We are assuming that we might be able to setup a lambda function in AWS to call the create_export_task from the AWS Python SDK to periodically export our cloudwatch logs to an S3 bucket with an SQS Queue. ?

For Both

Another potential workaround in both cases, would be...

  1. Setup additional Agent Policies specifically for each of these faucets of the AWS Integration
  2. Deploy an additional standalone elastic agent or a deployment with a single replica for each of these additional agent policies
  3. enrol the additional elastic agents so that they receive only the agent policy with that specific part of the AWS integration enabled in it.

All of these options involve extra setup for the end user and potentially additional cost and management overhead

Proposal

It would be really great if it was possible to enable Leader Election for the other parts of the AWS Integrations.

Whilst this would inevitably mean that there could be a bottleneck at the elastic agent which gets delegated as the leader this would probably not be an issue for most use cases and would potentially be a good default for these parts of the AWS integration.

This would remove the likelihood of users experiencing performance issues at the Elastic Cluster side and also reduce the chance of unnecessary calls to AWS APIs and associated costs


It would also be really great if there was a specific option in the AWS Integration for collecting CloudWatch logs from S3/SQS.

If this was implemented, then in the case of massive amounts of logs and in cases where users run into issues where there is a bottleneck at the elastic agent that was delegated leader, additional configuration such as increasing resource limits for elastic agents or sending logs to S3/SQS would still be an option.