crossplane-contrib / provider-upjet-aws

Official AWS Provider for Crossplane by Upbound.
https://marketplace.upbound.io/providers/upbound/provider-aws
Apache License 2.0
143 stars 120 forks source link

Cache AWS Config's CredentialsProvider to reduce STS calls #1235

Closed erhancagirici closed 6 months ago

erhancagirici commented 6 months ago

Description of your changes

Fixes #997 Introduces a global credential cache to reduce AWS STS calls. Only IRSA credentials are cached.

The new provider cache is a two-layer hierarchical cache. L1 cache is an AWS SDK Go aws.CredentialsCache and the L2 cache is for caching the aws.CredentialsCaches. The cache key for the L2 cache is derived from the well known IRSA authentication parameters as well as the contents of the OIDC ID token file. The new cache also caches the AWS account ID for a given IRSA configuration and replaces the identity cache for IRSA configurations.

Background:

I have:

How has this code been tested

Tested manually with provider configs with:

Two experiments were done using 4 managed resources (MRs) with a plain IRSA configuration and an IRSA configuration with an assume role chain of length two with the following ProviderConfig.aws:

apiVersion: aws.upbound.io/v1beta1
kind: ProviderConfig
metadata:
  name: default
spec:
  assumeRoleChain:
  - roleARN: arn:aws:iam::<account ID>:role/alper-rc-1
  - roleARN: arn:aws:iam::<account ID>:role/alper-rc-2
  credentials:
    source: IRSA

During these experiments, we forced frequent reconciliations of the MRs (every 3 seconds) in constant update loops and we also observed the AWS CloudTrail event history for an extended period of time. Here are the relevant events from CloudTrail:

image

As the logs show, for these 4 MRs, at most only one sts.AssumeRoleWithWebIdentity operation per an hour has been recorded, showing the effectiveness of the credential cache for IRSA authentication. Please note that the temporary credentials issued by the sts.AssumeRoleWithWebIdentity are valid for one hour. It's the L1 cache that discards these temporary credentials after one hour and renews them. During this extended period, because the L2 cache item is not discarded, only one sts.GetCallerIdentity operation has been observed.

I also did a test for the L2 cache by invaliding the cache entry prematurely. The following event logs show how this results in a premature call to sts.AssumeRoleWithWebIdentity:

image

After the temporary credentials were fetched at March 28, 2024, 16:36:47, we would not expect them to be refreshed before an hour but causing the L2 cached entry go stale, there's been a premature call at March 28, 2024, 16:46:22.

Also tested the PR on top of @mergenci's API Call Counters PR. Under an update loop, the reported API call counters for sts.AssumeRoleWithWebIdentity & sts.GetCallerIdentity are not increasing:

❯ curl -s http://localhost:8080/metrics | grep upjet | grep upjet_resource_external_api_calls_total
# HELP upjet_resource_external_api_calls_total The number of external API calls.
# TYPE upjet_resource_external_api_calls_total counter
upjet_resource_external_api_calls_total{operation="AssumeRole",service="STS"} 2
upjet_resource_external_api_calls_total{operation="AssumeRoleWithWebIdentity",service="STS"} 1
upjet_resource_external_api_calls_total{operation="CreateRole",service="IAM"} 1
upjet_resource_external_api_calls_total{operation="GetCallerIdentity",service="STS"} 1
upjet_resource_external_api_calls_total{operation="GetRole",service="IAM"} 26
upjet_resource_external_api_calls_total{operation="GetRolePolicy",service="IAM"} 25
upjet_resource_external_api_calls_total{operation="ListAttachedRolePolicies",service="IAM"} 25
upjet_resource_external_api_calls_total{operation="ListRolePolicies",service="IAM"} 25
upjet_resource_external_api_calls_total{operation="PutRolePolicy",service="IAM"} 1
❯ curl -s http://localhost:8080/metrics | grep upjet | grep upjet_resource_external_api_calls_total
# HELP upjet_resource_external_api_calls_total The number of external API calls.
# TYPE upjet_resource_external_api_calls_total counter
upjet_resource_external_api_calls_total{operation="AssumeRole",service="STS"} 2
upjet_resource_external_api_calls_total{operation="AssumeRoleWithWebIdentity",service="STS"} 1
upjet_resource_external_api_calls_total{operation="CreateRole",service="IAM"} 1
upjet_resource_external_api_calls_total{operation="GetCallerIdentity",service="STS"} 1
upjet_resource_external_api_calls_total{operation="GetRole",service="IAM"} 61
upjet_resource_external_api_calls_total{operation="GetRolePolicy",service="IAM"} 60
upjet_resource_external_api_calls_total{operation="ListAttachedRolePolicies",service="IAM"} 60
upjet_resource_external_api_calls_total{operation="ListRolePolicies",service="IAM"} 60
upjet_resource_external_api_calls_total{operation="PutRolePolicy",service="IAM"} 1
ulucinar commented 6 months ago

/test-examples="examples/iam/v1beta1/role.yaml"

ulucinar commented 6 months ago

/test-examples="examples/sns/v1beta1/topic.yaml"

sergenyalcin commented 6 months ago

/test-examples="examples/sns/v1beta1/topic.yaml"

ulucinar commented 6 months ago

/test-examples="examples/sns/v1beta1/topic.yaml"

ulucinar commented 6 months ago

Thanks @erhancagirici, @sergenyalcin, lgtm.