aws / aws-sdk-go-v2

AWS SDK for the Go programming language.
https://aws.github.io/aws-sdk-go-v2/docs/
Apache License 2.0
2.68k stars 651 forks source link

[FATAL] "Error while getting secret values from secret manager " #2708

Closed raghavendragujjar closed 4 months ago

raghavendragujjar commented 4 months ago

Describe the bug

"error": "operation error Secrets Manager: GetSecretValue, get identity: get credentials: failed to refresh cached credentials, no EC2 IMDS role found, operation error ec2imds: GetMetadata, exceeded maximum number of attempts, 3, http response error StatusCode: 503, request to EC2 IMDS failed", "dt.entity.host": "HOST-XXXXXXXXXXXXXXX", "dt.entity.host_group": "HOST_GROUP-XXXXXXXXXXXXXXX", "dt.entity.process_group": "PROCESS_GROUP-XXXXXXXXXXXXXXX", "dt.entity.process_group_instance": "PROCESS_GROUP_INSTANCE-XXXXXXXXXXXXXXX", "dt.host_group.id": "prod-env", "dt.span_id": "XXXXXXXXXXXXXXX", "dt.trace_id": "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX", "dt.trace_sampled": "true"}

Expected Behavior

We need to get GetSecretValue whenever we want to execute the SDK Code we included in functionality

Current Behavior

Service is crashing coz we are not getting GetSecretValue

Reproduction Steps

NO

Possible Solution

No response

Additional Information/Context

region := "XXXX"

config, err := config.LoadDefaultConfig(context.TODO(), config.WithRegion(region))
    if err != nil {
         L.L.Fatal("Error while loading default AWS config :", L.Error(err))
    }

    svc := secretsmanager.NewFromConfig(config)

    input := &secretsmanager.GetSecretValueInput{
        SecretId:     aws.String(secretName),
        VersionStage: aws.String("AWSCURRENT"), // VersionStage defaults to AWSCURRENT if unspecified
    }

    result, err := svc.GetSecretValue(context.TODO(), input)
    if err != nil {
        // For a list of exceptions thrown, see
        // https://docs.aws.amazon.com/secretsmanager/latest/apireference/API_GetSecretValue.html
         L.L.Fatal("Error while getting secret values from secret manager", L.Error(err))
    }

SDK version used

v1.26.1

Environment details (Version of Go (go version)? OS name and version, etc.)

1.20

lucix-aws commented 4 months ago

You appear to be using SDK v2 based on the signature of GetSecretValue. Transferring.

raghavendragujjar commented 4 months ago

You appear to be using SDK v2 based on the signature of GetSecretValue. Transferring.

Would you please give more details on this, so that can help us to fix this asap.

secretName := "XXXX/XXXX" region := "XXXX"

config, err := config.LoadDefaultConfig(context.TODO(), config.WithRegion(region)) if err != nil { L.L.Fatal("Error while loading default AWS config :", L.Error(err)) }

svc := secretsmanager.NewFromConfig(config)

input := &secretsmanager.GetSecretValueInput{
    SecretId:     aws.String(secretName),
    VersionStage: aws.String("AWSCURRENT"), // VersionStage defaults to AWSCURRENT if unspecified
}

result, err := svc.GetSecretValue(context.TODO(), input)
if err != nil {
    // For a list of exceptions thrown, see
    // https://docs.aws.amazon.com/secretsmanager/latest/apireference/API_GetSecretValue.html
     L.L.Fatal("Error while getting secret values from secret manager", L.Error(err))
}
RanVaknin commented 4 months ago

Hi @raghavendragujjar,

Would you please give more details on this, so that can help us to fix this asap.

What @lucix-aws was saying is that you opened this in the wrong github repo (Go SDK v1), and that he transferred it over to the appropriate repo (Go SDK v2).

I'm not sure why you are seeing a 503 as it indicates a service side error, but the failure is not from getting the secret, but rather to obtain the credentials needed to get that secret. How are you running your code? Locally? on an EC2 host? Container?

Additionally please enable the request and response logs and share those with us so we can see if we can get any other info from those logs. You enable logs like so:

    cfg, err := config.LoadDefaultConfig(context.TODO(), config.WithRegion(region), config.WithClientLogMode(aws.LogRetries|aws.LogRequestWithBody))

Also you do not need to redact your region as it's not considered sensitive info.

Thanks, Ran~

ankush1593 commented 4 months ago

Thanks @RanVaknin. I'm a colleague of @raghavendragujjar and we are using "eu-west-1" region if that helps. This is being run in production environment in a AWS EKS pod. We'll enable the logs as suggested. Let me know if you need any other info.

RanVaknin commented 4 months ago

Hi @ankush1593 ,

Thanks for the info. If youre running the SDK on an EKS pod then you'll need to configure the IRSA provider in order for the SDK to be able to obtain credentials when running on the pod.

When you invoke a certain API operation, the SDK will go and try to fetch credentials from various sources. If youre on an EKS pod, then the SDK will attempt to retrieve your IRSA token from the pod's file system, and then use it to call AssumeRoleWithWebIdentity to exchange that token with temporary credentials that later are going to get used to call GetSecretValue.

The AssumeRoleWithWebIdentity call that the SDK does is implicit and you wont see that request unless you enable the logs. Once you have those logs please share them with us.

If you havent setup IRSA for your EKS cluster please refer to these docs

Thanks, Ran~

ankush1593 commented 4 months ago

Hi @RanVaknin ,

We didn't have IRSA setup as mentioned in documentation and we are using AssumeRole with Kubernetes default service account, but not AssumeRoleWithWebIdentity. Also, please find the masked logs which we added in lower-environment (these are successful ones without any errors/issues)

SDK 2024/07/15 10:36:23 DEBUG Request
PUT /latest/api/token HTTP/1.1
Host: xxx.xxx.xxx.xxx
User-Agent: aws-sdk-go-v2/1.26.1 os/linux lang/go#1.20.4 md/GOOS#linux md/GOARCH#amd64 ft/ec2-imds
Content-Length: 0
Amz-Sdk-Request: attempt=1; max=3
X-Aws-Ec2-Metadata-Token-Ttl-Seconds: 300
Accept-Encoding: gzip

SDK 2024/07/15 10:36:23 DEBUG Request
GET /latest/meta-data/iam/security-credentials/ HTTP/1.1
Host: xxx.xxx.xxx.xxx
User-Agent: aws-sdk-go-v2/1.26.1 os/linux lang/go#1.20.4 md/GOOS#linux md/GOARCH#amd64 ft/ec2-imds
Amz-Sdk-Request: attempt=1; max=3
X-Aws-Ec2-Metadata-Token: RHVtbXlUb2tlbg==
Accept-Encoding: gzip

SDK 2024/07/15 10:36:23 DEBUG Request
GET /latest/meta-data/iam/security-credentials/ppbet-cms-qa-worker-role HTTP/1.1
Host: xxx.xxx.xxx.xxx
User-Agent: aws-sdk-go-v2/1.26.1 os/linux lang/go#1.20.4 md/GOOS#linux md/GOARCH#amd64 ft/ec2-imds
Amz-Sdk-Request: attempt=1; max=3
X-Aws-Ec2-Metadata-Token: RHVtbXlUb2tlbg==
Accept-Encoding: gzip

SDK 2024/07/15 10:36:23 DEBUG Request
POST / HTTP/1.1
Host: secretsmanager.eu-west-1.amazonaws.com
User-Agent: aws-sdk-go-v2/1.26.1 os/linux lang/go#1.20.4 md/GOOS#linux md/GOARCH#amd64 api/secretsmanager#1.28.6
Content-Length: 55
Amz-Sdk-Invocation-Id: d8c65882-c5e8-4ffe-921b-35c803aed87c
Amz-Sdk-Request: attempt=1; max=3
Authorization: AWS4-HMAC-SHA256 Credential=ASIA4XXXXXX4XXXX63XX/20240715/eu-west-1/secretsmanager/aws4_request, SignedHeaders=amz-sdk-invocation-id;amz-sdk-request;content-length;content-type;host;x-amz-date;x-amz-security-token;x-amz-target, Signature=73b721277f36vf4509791dda91f0a6e9c96z76595b9186u8b7la349a3532063z
Content-Type: application/x-amz-json-1.1
X-Amz-Date: 20240715T103623Z
X-Amz-Security-Token: RHVtbXlUb2tlbg==
X-Amz-Target: secretsmanager.GetSecretValue
Accept-Encoding: gzip
RanVaknin commented 4 months ago

Hey @ankush1593 ,

I just noticed I've given you the wrong logging flag. My apologies, it should have been:

    cfg, err := config.LoadDefaultConfig(context.TODO(), config.WithRegion(region), config.WithClientLogMode(aws.LogResponseWithBody|aws.LogRequestWithBody))

So that we may see the responses as well. Can you add those and the logs please?

Based on looking at the request logs, the credential provider being used is the IMDS provider which is the old way to give pods AWS credentials. From the request logs I cant see any explicit failures but based on the fact that these request are firing sequentially piping the values from each request to the next (metadata token, and role name) I can deduce that at least the first 2 API calls were successful.

Are you seeing this 503 error in all cases? or is it happening intermittently? if it happens intermittently are you observing retries that eventually make this call successful?

Thanks, Ran~

ankush1593 commented 4 months ago

Hi @RanVaknin ,

The 503 errors were very infrequent (3-4 times since last couple of months since we deployed). We do not have retries for the same request but the other calls have succeeded after the error response. We have not observed the issue in lower environments as well. Please find the successful masked logs in the attached file for lower environment : SecretManagerResponse.log

Also, is it possible for this error to occur if the IRSA wasn't setup with a service account as mentioned in my previous response?

RanVaknin commented 4 months ago

Hi there,

The idea behind me asking for logs was to examine the execution flow that leads to those errors in the first place. If you are not seeing these errors anymore, then we are not really able to help / hypothesize why they happened. The SDK is a client and because it only lives on your machine, we do not have logs for previous failed executions to investigate why failures happened after the fact.

Also, is it possible for this error to occur if the IRSA wasn't setup with a service account as mentioned in my previous response?

It's unlikely. IRSA is relatively new, and while it is the standard for EKS based Auth, before that using IMDS was one ways you can give your pod credentials (that or with the Container metadata service).

If the 503 issue comes up again, make sure you have the necessary logging infrastructure to either capture the flow, or at least capture the request ID (can be done only by enabling aws.LogResponse or aws.LogResponseWithBodyflag). If you have access to AWS support, they'll be able to use that request ID to identify the failed request and examine the request / conditions that might have contributed to its occurrence.

Thanks, Ran~

ankush1593 commented 4 months ago

Hi @RanVaknin ,

Given that successful logs won't help you and the occurrence of issue is very infrequent, we can close the issue for the time being. From our side, we'll enable the logs and move to IRSA. In case we encounter this again, we'll have the response logs and post it here. Thanks for the help.

Regards, Ankush

github-actions[bot] commented 4 months ago

This issue is now closed. Comments on closed issues are hard for our team to see. If you need more assistance, please open a new issue that references this one.