Closed erhancagirici closed 6 months ago
/test-examples="examples/iam/v1beta1/role.yaml"
/test-examples="examples/sns/v1beta1/topic.yaml"
/test-examples="examples/sns/v1beta1/topic.yaml"
/test-examples="examples/sns/v1beta1/topic.yaml"
Thanks @erhancagirici, @sergenyalcin, lgtm.
Description of your changes
Fixes #997 Introduces a global credential cache to reduce AWS STS calls. Only IRSA credentials are cached.
The new provider cache is a two-layer hierarchical cache. L1 cache is an AWS SDK Go
aws.CredentialsCache
and the L2 cache is for caching theaws.CredentialsCache
s. The cache key for the L2 cache is derived from the well known IRSA authentication parameters as well as the contents of the OIDC ID token file. The new cache also caches the AWS account ID for a given IRSA configuration and replaces the identity cache for IRSA configurations.Background:
I have:
make reviewable
to ensure this PR is ready for review.backport release-x.y
labels to auto-backport this PR if necessary.How has this code been tested
Tested manually with provider configs with:
index.docker.io/ulucinar/provider-aws-ec2:v1.3.0-0fbbf02b3656352c729396851646d12ef80a1496
forUpbound
authentication on Upbound Cloud.Secret
) has succeeded here: https://github.com/crossplane-contrib/provider-upjet-aws/actions/runs/8468707015Two experiments were done using 4 managed resources (MRs) with a plain IRSA configuration and an IRSA configuration with an assume role chain of length two with the following
ProviderConfig.aws
:During these experiments, we forced frequent reconciliations of the MRs (every 3 seconds) in constant update loops and we also observed the AWS CloudTrail event history for an extended period of time. Here are the relevant events from CloudTrail:
As the logs show, for these 4 MRs, at most only one
sts.AssumeRoleWithWebIdentity
operation per an hour has been recorded, showing the effectiveness of the credential cache for IRSA authentication. Please note that the temporary credentials issued by thests.AssumeRoleWithWebIdentity
are valid for one hour. It's the L1 cache that discards these temporary credentials after one hour and renews them. During this extended period, because the L2 cache item is not discarded, only onests.GetCallerIdentity
operation has been observed.I also did a test for the L2 cache by invaliding the cache entry prematurely. The following event logs show how this results in a premature call to
sts.AssumeRoleWithWebIdentity
:After the temporary credentials were fetched at
March 28, 2024, 16:36:47
, we would not expect them to be refreshed before an hour but causing the L2 cached entry go stale, there's been a premature call atMarch 28, 2024, 16:46:22
.Also tested the PR on top of @mergenci's API Call Counters PR. Under an update loop, the reported API call counters for
sts.AssumeRoleWithWebIdentity
&sts.GetCallerIdentity
are not increasing: