Cache describe_regions using lru_cache from stdlib

zmoog commented 2 weeks ago

What does this PR do?

Caches EC2:DescribeRegion API calls response.

Why is it important?

On high-volume deployments, ESF can hit the EC2:DescribeRegions API requests limit, causing throttling errors like the following:

An error occurred (RequestLimitExceeded) when calling the DescribeRegions operation (reached max retries: 4): Request limit exceeded.

ESF needs the list of existing regions to parse incoming events from the cloudwatch-logs input. Since new AWS region additions do not happen frequently, picking up and caching the list of existing regions at function startup seems adequate.

The list of existing AWS regions is available at https://aws.amazon.com/about-aws/global-infrastructure/regions_az/

Checklist

[x] My code follows the style guidelines of this project
[x] I have commented my code, particularly in hard-to-understand areas
[x] I have made corresponding changes to the documentation
[ ] I have made corresponding change to the default configuration files
[x] I have added tests that prove my fix is effective or that my feature works
[x] I have added an entry in CHANGELOG.md

zmoog commented 2 weeks ago

I can probably add cache expiration to avoid a stale region list:

import threading
import time
from functools import wraps
from typing import Any, Callable

def cache_for(seconds: int) -> Callable:
    """
    Caches the result of a function for a specified number of seconds."""
    def decorator(func: Callable) -> Callable:
        lock = threading.Lock()
        cache = {}
        hits = misses = 0

        @wraps(func)
        def wrapper(*args: Any, **kwargs: Any) -> Any:
            nonlocal hits, misses
            with lock:
                key = str(args) + str(kwargs)
                current_time = time.time()

                if key in cache:
                    result, timestamp = cache[key]
                    if current_time - timestamp < seconds:
                        hits += 1
                        return result

                misses += 1

                result = func(*args, **kwargs)
                cache[key] = (result, current_time)

                return result

        def cache_stats() -> dict:
            """
            Returns the cache statistics.

            :return: A dictionary containing the cache statistics"""
            with lock:
                return {'hits': hits, 'misses': misses}

        wrapper.cache_stats = cache_stats

        return wrapper

    return decorator    

@cache_for(seconds=60)
def describe_regions(all_regions: bool = True) -> Any:
    """
    Fetches all regions from AWS and returns the response.

    :return: The response from the describe_regions method
    """
    return get_ec2_client().describe_regions(AllRegions=all_regions)

# Example usage with AWS regions
import boto3

def get_ec2_client():
    return boto3.client('ec2')

# Example usage to access cache statistics
if __name__ == "__main__":
    describe_regions(all_regions=False)
    print(describe_regions.cache_stats())  # {'hits': 0, 'misses': 1}

    describe_regions(all_regions=False)
    print(describe_regions.cache_stats())  # {'hits': 1, 'misses': 1}
    describe_regions(all_regions=False)
    print(describe_regions.cache_stats())  # {'hits': 2, 'misses': 1}
    describe_regions(all_regions=False)
    print(describe_regions.cache_stats())  # {'hits': 3, 'misses': 1}
    describe_regions(all_regions=False)
    print(describe_regions.cache_stats())  # {'hits': 4, 'misses': 1}
    describe_regions(all_regions=False)
    print(describe_regions.cache_stats())  # {'hits': 5, 'misses': 1}
    describe_regions(all_regions=False)
    print(describe_regions.cache_stats())  # {'hits': 6, 'misses': 1}
    describe_regions(all_regions=False)
    print(describe_regions.cache_stats())  # {'hits': 7, 'misses': 1}

I get the following output:

{'hits': 0, 'misses': 1}
{'hits': 1, 'misses': 1}
{'hits': 2, 'misses': 1}
{'hits': 3, 'misses': 1}
{'hits': 4, 'misses': 1}
{'hits': 5, 'misses': 1}
{'hits': 6, 'misses': 1}
{'hits': 7, 'misses': 1}

zmoog commented 2 weeks ago

Or, we can use a 3rd party library like https://cachetools.readthedocs.io/en/latest/

from typing import Any

from cachetools.func import ttl_cache

@ttl_cache(ttl=1800) # 30 minutes
def describe_regions(all_regions: bool = True) -> Any:
    """
    Fetches all regions from AWS and returns the response.

    :return: The response from the describe_regions method
    """
    print("Fetching regions from AWS...")
    return get_ec2_client().describe_regions(AllRegions=all_regions)

# Example usage with AWS regions
import boto3

def get_ec2_client():
    return boto3.client('ec2')

# Example usage to access cache statistics
if __name__ == "__main__":

    # print(dir(describe_regions))

    describe_regions(all_regions=False)
    print(describe_regions.cache_info())  # {'hits': 0, 'misses': 1}

    describe_regions(all_regions=False)
    print(describe_regions.cache_info())  # {'hits': 1, 'misses': 1}
    describe_regions(all_regions=False)
    print(describe_regions.cache_info())  # {'hits': 2, 'misses': 1}
    describe_regions(all_regions=False)
    print(describe_regions.cache_info())  # {'hits': 3, 'misses': 1}
    describe_regions(all_regions=False)
    print(describe_regions.cache_info())  # {'hits': 4, 'misses': 1}
    describe_regions(all_regions=False)
    print(describe_regions.cache_info())  # {'hits': 5, 'misses': 1}
    describe_regions(all_regions=False)
    print(describe_regions.cache_info())  # {'hits': 6, 'misses': 1}
    describe_regions(all_regions=False)
    print(describe_regions.cache_info())  # {'hits': 7, 'misses': 1}

Output:

Fetching regions from AWS...
CacheInfo(hits=0, misses=1, maxsize=128, currsize=1)
CacheInfo(hits=1, misses=1, maxsize=128, currsize=1)
CacheInfo(hits=2, misses=1, maxsize=128, currsize=1)
CacheInfo(hits=3, misses=1, maxsize=128, currsize=1)
CacheInfo(hits=4, misses=1, maxsize=128, currsize=1)
CacheInfo(hits=5, misses=1, maxsize=128, currsize=1)
CacheInfo(hits=6, misses=1, maxsize=128, currsize=1)
CacheInfo(hits=7, misses=1, maxsize=128, currsize=1)

constanca-m commented 1 week ago

I wanted to see the details of what is happening exactly in our code and why this is an issue. I wrote this while I was testing it:

The problem comes from this line:

https://github.com/elastic/elastic-serverless-forwarder/blob/8be4fc4b87a06aa857e0cbc227f55f0533d89035/handlers/aws/handler.py#L147-L152

We need input_id so we can know where to send the data (this is inside event_input). This information was first available in the configuration provided by the user, stored in the ESF bucket, and was stored in ESF at the start upon parsing config.yaml. This input_id is unique and each input_id should map to an Input that should be specified in the configuration file.

How do we obtain the `input_id` for each specific trigger?

We have 4 possible triggers:

cloudwatch-logs -> not available in the event that triggers ESF
kinesis-data-stream -> input_id inside lambda_event (this is the event that triggers ESF)
s3-sqs - available in lambda_event
sqs - available in lambda_event

I wanted to know what is inside in lambda_event if it comes from a cloudwatch logs. I sent a message in a log stream to trigger it. This is the lambda_event that my ESF got.

{
   "awslogs":{
      "data":"H4sIAAAAAAAA/42QPWvDMBRF/0p4swX6lqzNUDdTJ2croTjJqyuwJaOntJSQ/17c0L3LHS6cc+HeYEGiccLD94oQ4Kk7dG8v/TB0+x4ayF8JCwSw0klvleFCaWhgztO+5OsKAc45UR3TeWQXpHOJJ2QFp5gTMaR3VpHqAxhqwXGBAJQXZHOeGD2aBuh62tC1xpye41yxEITXf6mPv+7+E1PdmBvECwRQ3gijlNbKGa9VK6y2rd3Sc6tla6zx3nLluOReWNsaL7n0HhqocUGq47JCEE467p2SngvR/B0FAcaU6weW3ZynHW7TcD/efwDlkGd8SgEAAA=="
   }
}

We decode the data field, which looks like this:

{
   "messageType":"DATA_MESSAGE",
   "owner":"627286350134",
   "logGroup":"constanca-describe-regions-esf-test",
   "logStream":"some-log-stream",
   "subscriptionFilters":[
      "constanca-describe-regions-esf-test"
   ],
   "logEvents":[
      {
         "id":"38515334437584391646961646806429565886037020816695820288",
         "timestamp":1727087328011,
         "message":"another log event"
      }
   ]
}

In our config.yaml file we need to provide the input_id for ESF as the cloudwatch ARN, in this format: arn:aws:logs:{region}:{account-id}:log-group:{log_group_name}:* or as arn:aws:logs:{region}:{account-id}:log-group:{log_group_name}:log-stream:{log-stream-name} (see official documentation). So what do we have in this data field that we can use?

region - No
account-id - Yes, field owner
log_group_name - Yes, field logGroup
log-stream-name - Yes, field logStream

So we are only missing region to obtain the input_id so we can then get the output to send the data to.

How do we obtain the `input_id` for cloudwatch trigger then?

Currently, we make the call EC2:DescribeRegion API every time an event from a cloudwatch logs group triggers ESF.

Here is a sample of the result of this call in my test.

```json { "Regions":[ { "Endpoint":"ec2.ap-south-2.amazonaws.com", "RegionName":"ap-south-2", "OptInStatus":"not-opted-in" }, { "Endpoint":"ec2.ap-south-1.amazonaws.com", "RegionName":"ap-south-1", "OptInStatus":"opt-in-not-required" }, { "Endpoint":"ec2.eu-south-1.amazonaws.com", "RegionName":"eu-south-1", "OptInStatus":"not-opted-in" }, { "Endpoint":"ec2.eu-south-2.amazonaws.com", "RegionName":"eu-south-2", "OptInStatus":"not-opted-in" }, { "Endpoint":"ec2.me-central-1.amazonaws.com", "RegionName":"me-central-1", "OptInStatus":"not-opted-in" }, { "Endpoint":"ec2.il-central-1.amazonaws.com", "RegionName":"il-central-1", "OptInStatus":"not-opted-in" }, { "Endpoint":"ec2.ca-central-1.amazonaws.com", "RegionName":"ca-central-1", "OptInStatus":"opt-in-not-required" }, { "Endpoint":"ec2.eu-central-1.amazonaws.com", "RegionName":"eu-central-1", "OptInStatus":"opt-in-not-required" }, { "Endpoint":"ec2.eu-central-2.amazonaws.com", "RegionName":"eu-central-2", "OptInStatus":"not-opted-in" }, { "Endpoint":"ec2.us-west-1.amazonaws.com", "RegionName":"us-west-1", "OptInStatus":"opt-in-not-required" }, { "Endpoint":"ec2.us-west-2.amazonaws.com", "RegionName":"us-west-2", "OptInStatus":"opt-in-not-required" }, { "Endpoint":"ec2.af-south-1.amazonaws.com", "RegionName":"af-south-1", "OptInStatus":"not-opted-in" }, { "Endpoint":"ec2.eu-north-1.amazonaws.com", "RegionName":"eu-north-1", "OptInStatus":"opt-in-not-required" }, { "Endpoint":"ec2.eu-west-3.amazonaws.com", "RegionName":"eu-west-3", "OptInStatus":"opt-in-not-required" }, { "Endpoint":"ec2.eu-west-2.amazonaws.com", "RegionName":"eu-west-2", "OptInStatus":"opt-in-not-required" }, { "Endpoint":"ec2.eu-west-1.amazonaws.com", "RegionName":"eu-west-1", "OptInStatus":"opt-in-not-required" }, { "Endpoint":"ec2.ap-northeast-3.amazonaws.com", "RegionName":"ap-northeast-3", "OptInStatus":"opt-in-not-required" }, { "Endpoint":"ec2.ap-northeast-2.amazonaws.com", "RegionName":"ap-northeast-2", "OptInStatus":"opt-in-not-required" }, { "Endpoint":"ec2.me-south-1.amazonaws.com", "RegionName":"me-south-1", "OptInStatus":"not-opted-in" }, { "Endpoint":"ec2.ap-northeast-1.amazonaws.com", "RegionName":"ap-northeast-1", "OptInStatus":"opt-in-not-required" }, { "Endpoint":"ec2.sa-east-1.amazonaws.com", "RegionName":"sa-east-1", "OptInStatus":"opt-in-not-required" }, { "Endpoint":"ec2.ap-east-1.amazonaws.com", "RegionName":"ap-east-1", "OptInStatus":"not-opted-in" }, { "Endpoint":"ec2.ca-west-1.amazonaws.com", "RegionName":"ca-west-1", "OptInStatus":"not-opted-in" }, { "Endpoint":"ec2.ap-southeast-1.amazonaws.com", "RegionName":"ap-southeast-1", "OptInStatus":"opt-in-not-required" }, { "Endpoint":"ec2.ap-southeast-2.amazonaws.com", "RegionName":"ap-southeast-2", "OptInStatus":"opt-in-not-required" }, { "Endpoint":"ec2.ap-southeast-3.amazonaws.com", "RegionName":"ap-southeast-3", "OptInStatus":"not-opted-in" }, { "Endpoint":"ec2.ap-southeast-4.amazonaws.com", "RegionName":"ap-southeast-4", "OptInStatus":"not-opted-in" }, { "Endpoint":"ec2.us-east-1.amazonaws.com", "RegionName":"us-east-1", "OptInStatus":"opt-in-not-required" }, { "Endpoint":"ec2.ap-southeast-5.amazonaws.com", "RegionName":"ap-southeast-5", "OptInStatus":"not-opted-in" }, { "Endpoint":"ec2.us-east-2.amazonaws.com", "RegionName":"us-east-2", "OptInStatus":"opt-in-not-required" } ], "ResponseMetadata":{ "RequestId":"b726bdc7-7f34-4884-a1e7-1abf061593f8", "HTTPStatusCode":200, "HTTPHeaders":{ "x-amzn-requestid":"b726bdc7-7f34-4884-a1e7-1abf061593f8", "cache-control":"no-cache, no-store", "strict-transport-security":"max-age=31536000; includeSubDomains", "vary":"accept-encoding", "content-type":"text/xml;charset=UTF-8", "content-length":"4846", "date":"Mon, 23 Sep 2024 10:45:35 GMT", "server":"AmazonEC2" }, "RetryAttempts":0 } } ```

From this result, and for every RegionName we do:

Create the ARN with log stream specified
- Look for this ARN in the configuration. Is it there? If yes, return the output we want to send the data to. If not:
- Create the ARN with log stream specified.
- Look for it in the configuration. Is it there? If yes, return the output we want to send the data to. If not, continue the cycle or error.

How to stop all these API calls?

Understand if a cloudwatch logs event can trigger ESF from a different region.
- If it can:
- Do the regions change? Then periodically make this API call to update the regions. This is what this PR does.
- The regions do not change. Then maybe we can just hardcode it, right @zmoog? I do not see advantages in calling the API. Or we could just call once at the start and store the result.
- If it can not: then obtain the region of the lambda, which will be the same as the cloudwatch logs group.

From my understanding, the region needs to be the same. So is there any reason we we would want to keep this API call @zmoog?

zmoog commented 1 week ago

Thanks for the in-depth analysis.

I tested the cloudwatch lambda trigger on the AWS console and ESF. As of today, it seems cloudwatch lambda triggers can only work with log groups in the same region from as the lambda functions. For example, if I deploy ESF on eu-west-1, I can only process log events from log groups on eu-west-1.

Given this limit, there is no reason to keep calling the EC2:DescribeRegion API on every event.

I plan to remove this API call from ESF.

Here's my two-steps plan:

Add the @lru_cache decorator from the standard library to reduce the number of API calls from one every event to just one on start. This small risk change would allow us to ship a patch release today.
Go through the process of removing the API call (change the code to use the region from the function, remove the required permissions from the infrastructure, and test the whole package).

WDYT?

constanca-m commented 1 week ago

I am fine with approving the PR as it is.

You need to change the version of ESF currently (I believe you need to update the changelog and version.py. After that the release workflow will be triggered, but if we push this change just like this, then nothing will happen.

zmoog commented 1 week ago

You need to change the version of ESF currently (I believe you need to update the changelog and version.py. After that the release workflow will be triggered, but if we push this change just like this, then nothing will happen.

Thanks! On it.

zmoog commented 1 week ago

I would say we should keep #723 open.

I agree!

I doubt we will have issues like this again with the cache, but let's see!

I'll work on removing the EC2:DescribeRegions API call later this week.

elastic / elastic-serverless-forwarder