elastic / elastic-serverless-forwarder

Elastic Serverless Forwarder
Other
35 stars 36 forks source link

Investigate number of API calls of ESF due to ec2:describe_regions usage #723

Closed gizas closed 1 week ago

gizas commented 4 months ago

Describe the enhancement: Currently in order to extract the region of cloudwatch logs_group stream ,we make use of ec2 DescribeRegions function in here. There have been cases reported where the number of concurrent lamda functions can trigger multiple times the DescribeRegions fucntion and raise the number of API calls until throttling events to happen.

This issue will try to investigate alternative solutions in order to overcome the API throttling in AWS

Describe a specific use case for the enhancement or feature:

Possible solutions can be:

constanca-m commented 4 months ago

When the function with this part of the code:

https://github.com/elastic/elastic-serverless-forwarder/blob/acbe70242afad1d5061d64fd4d12b7e647de3768/handlers/aws/utils.py#L397

is called, it means ESF was triggered by a cloudwatch logs group. I don't think a cloudwatch logs group can be in a different region than ESF. Unfortunately, I cannot find documentation on this, but I have tried using a different region and did not succeed. So I am not sure. But if this is correct, than getting the region should be very simple as the only thing we have to do is this:

import os
  region = os.environ['AWS_REGION']

Reference here.

bturquet commented 3 months ago

@constanca-m could we use the Terraform option for some customers to set up the regions ? would it require to create this as a new variable ?

constanca-m commented 3 months ago

I think what I suggested in https://github.com/elastic/elastic-serverless-forwarder/issues/723#issuecomment-2124827063 would work for all types of deployment. But as I mentioned, I don't know if lambda can communicate with any resources in a different region - I think cloudwatch logs needs to be in the same one. If that is the case, that could be a solution. @bturquet

constanca-m commented 1 week ago

Update on this issue

The problem comes from this line:

https://github.com/elastic/elastic-serverless-forwarder/blob/8be4fc4b87a06aa857e0cbc227f55f0533d89035/handlers/aws/handler.py#L147-L152

We need input_id so we can know where to send the data (this is inside event_input). This information was first available in the configuration provided by the user, stored in the ESF bucket, and was stored in ESF at the start upon parsing config.yaml. This input_id is unique and each input_id should map to an Input that should be specified in the configuration file.

How do we obtain the input_id for each specific trigger?

We have 4 possible triggers:

  1. cloudwatch-logs -> not available in the event that triggers ESF
  2. kinesis-data-stream -> input_id inside lambda_event (this is the event that triggers ESF)
  3. s3-sqs - available in lambda_event
  4. sqs - available in lambda_event

I wanted to know what is inside in lambda_event if it comes from a cloudwatch logs. I sent a message in a log stream to trigger it. This is the lambda_event that my ESF got.

{
   "awslogs":{
      "data":"H4sIAAAAAAAA/42QPWvDMBRF/0p4swX6lqzNUDdTJ2croTjJqyuwJaOntJSQ/17c0L3LHS6cc+HeYEGiccLD94oQ4Kk7dG8v/TB0+x4ayF8JCwSw0klvleFCaWhgztO+5OsKAc45UR3TeWQXpHOJJ2QFp5gTMaR3VpHqAxhqwXGBAJQXZHOeGD2aBuh62tC1xpye41yxEITXf6mPv+7+E1PdmBvECwRQ3gijlNbKGa9VK6y2rd3Sc6tla6zx3nLluOReWNsaL7n0HhqocUGq47JCEE467p2SngvR/B0FAcaU6weW3ZynHW7TcD/efwDlkGd8SgEAAA=="
   }
}

We decode the data field, which looks like this:

{
   "messageType":"DATA_MESSAGE",
   "owner":"627286350134",
   "logGroup":"constanca-describe-regions-esf-test",
   "logStream":"some-log-stream",
   "subscriptionFilters":[
      "constanca-describe-regions-esf-test"
   ],
   "logEvents":[
      {
         "id":"38515334437584391646961646806429565886037020816695820288",
         "timestamp":1727087328011,
         "message":"another log event"
      }
   ]
}

In our config.yaml file we need to provide the input_id for ESF as the cloudwatch ARN, in this format: arn:aws:logs:{region}:{account-id}:log-group:{log_group_name}:* or as arn:aws:logs:{region}:{account-id}:log-group:{log_group_name}:log-stream:{log-stream-name} (see official documentation). So what do we have in this data field that we can use?

So we are only missing region to obtain the input_id so we can then get the output to send the data to.

How do we obtain the input_id for cloudwatch trigger then?

Currently, we make the call EC2:DescribeRegion API every time an event from a cloudwatch logs group triggers ESF.

Here is a sample of the result of this call in my test. ```json { "Regions":[ { "Endpoint":"ec2.ap-south-2.amazonaws.com", "RegionName":"ap-south-2", "OptInStatus":"not-opted-in" }, { "Endpoint":"ec2.ap-south-1.amazonaws.com", "RegionName":"ap-south-1", "OptInStatus":"opt-in-not-required" }, { "Endpoint":"ec2.eu-south-1.amazonaws.com", "RegionName":"eu-south-1", "OptInStatus":"not-opted-in" }, { "Endpoint":"ec2.eu-south-2.amazonaws.com", "RegionName":"eu-south-2", "OptInStatus":"not-opted-in" }, { "Endpoint":"ec2.me-central-1.amazonaws.com", "RegionName":"me-central-1", "OptInStatus":"not-opted-in" }, { "Endpoint":"ec2.il-central-1.amazonaws.com", "RegionName":"il-central-1", "OptInStatus":"not-opted-in" }, { "Endpoint":"ec2.ca-central-1.amazonaws.com", "RegionName":"ca-central-1", "OptInStatus":"opt-in-not-required" }, { "Endpoint":"ec2.eu-central-1.amazonaws.com", "RegionName":"eu-central-1", "OptInStatus":"opt-in-not-required" }, { "Endpoint":"ec2.eu-central-2.amazonaws.com", "RegionName":"eu-central-2", "OptInStatus":"not-opted-in" }, { "Endpoint":"ec2.us-west-1.amazonaws.com", "RegionName":"us-west-1", "OptInStatus":"opt-in-not-required" }, { "Endpoint":"ec2.us-west-2.amazonaws.com", "RegionName":"us-west-2", "OptInStatus":"opt-in-not-required" }, { "Endpoint":"ec2.af-south-1.amazonaws.com", "RegionName":"af-south-1", "OptInStatus":"not-opted-in" }, { "Endpoint":"ec2.eu-north-1.amazonaws.com", "RegionName":"eu-north-1", "OptInStatus":"opt-in-not-required" }, { "Endpoint":"ec2.eu-west-3.amazonaws.com", "RegionName":"eu-west-3", "OptInStatus":"opt-in-not-required" }, { "Endpoint":"ec2.eu-west-2.amazonaws.com", "RegionName":"eu-west-2", "OptInStatus":"opt-in-not-required" }, { "Endpoint":"ec2.eu-west-1.amazonaws.com", "RegionName":"eu-west-1", "OptInStatus":"opt-in-not-required" }, { "Endpoint":"ec2.ap-northeast-3.amazonaws.com", "RegionName":"ap-northeast-3", "OptInStatus":"opt-in-not-required" }, { "Endpoint":"ec2.ap-northeast-2.amazonaws.com", "RegionName":"ap-northeast-2", "OptInStatus":"opt-in-not-required" }, { "Endpoint":"ec2.me-south-1.amazonaws.com", "RegionName":"me-south-1", "OptInStatus":"not-opted-in" }, { "Endpoint":"ec2.ap-northeast-1.amazonaws.com", "RegionName":"ap-northeast-1", "OptInStatus":"opt-in-not-required" }, { "Endpoint":"ec2.sa-east-1.amazonaws.com", "RegionName":"sa-east-1", "OptInStatus":"opt-in-not-required" }, { "Endpoint":"ec2.ap-east-1.amazonaws.com", "RegionName":"ap-east-1", "OptInStatus":"not-opted-in" }, { "Endpoint":"ec2.ca-west-1.amazonaws.com", "RegionName":"ca-west-1", "OptInStatus":"not-opted-in" }, { "Endpoint":"ec2.ap-southeast-1.amazonaws.com", "RegionName":"ap-southeast-1", "OptInStatus":"opt-in-not-required" }, { "Endpoint":"ec2.ap-southeast-2.amazonaws.com", "RegionName":"ap-southeast-2", "OptInStatus":"opt-in-not-required" }, { "Endpoint":"ec2.ap-southeast-3.amazonaws.com", "RegionName":"ap-southeast-3", "OptInStatus":"not-opted-in" }, { "Endpoint":"ec2.ap-southeast-4.amazonaws.com", "RegionName":"ap-southeast-4", "OptInStatus":"not-opted-in" }, { "Endpoint":"ec2.us-east-1.amazonaws.com", "RegionName":"us-east-1", "OptInStatus":"opt-in-not-required" }, { "Endpoint":"ec2.ap-southeast-5.amazonaws.com", "RegionName":"ap-southeast-5", "OptInStatus":"not-opted-in" }, { "Endpoint":"ec2.us-east-2.amazonaws.com", "RegionName":"us-east-2", "OptInStatus":"opt-in-not-required" } ], "ResponseMetadata":{ "RequestId":"b726bdc7-7f34-4884-a1e7-1abf061593f8", "HTTPStatusCode":200, "HTTPHeaders":{ "x-amzn-requestid":"b726bdc7-7f34-4884-a1e7-1abf061593f8", "cache-control":"no-cache, no-store", "strict-transport-security":"max-age=31536000; includeSubDomains", "vary":"accept-encoding", "content-type":"text/xml;charset=UTF-8", "content-length":"4846", "date":"Mon, 23 Sep 2024 10:45:35 GMT", "server":"AmazonEC2" }, "RetryAttempts":0 } } ```

From this result, and for every RegionName we do:

How to stop all these API calls?

From my understanding, the region needs to be the same. So is there any reason we we would want to keep this API call @zmoog?

Originally posted by @constanca-m in https://github.com/elastic/elastic-serverless-forwarder/issues/803#issuecomment-2367890505

PR that implements cache to limit the API calls: https://github.com/elastic/elastic-serverless-forwarder/pull/803

constanca-m commented 1 week ago

Fixed in https://github.com/elastic/elastic-serverless-forwarder/pull/811.