aws / aws-sdk-js-v3

Modularized AWS SDK for JavaScript.
Apache License 2.0
3.09k stars 576 forks source link

Secrets Manager EPROTO error #3513

Closed steelbrain closed 1 year ago

steelbrain commented 2 years ago

Describe the bug

We're using Secrets Manager to initialize lambda state, and are frequently getting write EPROTO failure messages. It started happening recently after we upgraded from v3.41.0 to v3.58.0

Your environment

SDK version number

@aws-sdk/client-secrets-manager@3.58.0

Is the issue in the browser/Node.js/ReactNative?

Node.js

Details of the browser/Node.js/ReactNative version

Node.js 14.x Lambda :)

Steps to reproduce

Here's tl;dr of the lambda handler code

const { SecretsManagerClient, GetSecretValueCommand } = require('@aws-sdk/client-secrets-manager')

const promiseEnv = new SecretsManagerClient({
  region: process.env.AWS_ENV_SECRET_REGION,
}).send(
  new GetSecretValueCommand({
    SecretId: process.env.AWS_ENV_SECRET_ID,
  })
)

async function handler(event, context) {
  console.log('Requesting environment variables')
  const env = await promiseEnv
  console.log('Got environment variables')
  // ....
}

module.exports = { handler }

Observed behavior

Most of the times, everything works, but then unexpectedly crashes at await promiseEnv, and Got environment variables is never logged

Expected behavior

Secrets Manager would keep working

Screenshots

N/A

Additional context

Here's the raw logs: ```console [TS] [UUID] INFO Requesting environment variables [TS] [UUID] ERROR Invoke Error {"errorType":"Error","errorMessage":"write EPROTO","code":"EPROTO","errno":-71,"syscall":"write","$metadata":{"attempts":1,"totalRetryDelay":0},"stack":["Error: write EPROTO"," at WriteWrap.onWriteComplete [as oncomplete] (internal/stream_base_commons.js:94:16)"," at WriteWrap.callbackTrampoline (internal/async_hooks.js:130:17)"]} [TS] [UUID] ERROR (node:9) PromiseRejectionHandledWarning: Promise rejection was handled asynchronously (rejection id: 14)\n(Use `node --trace-warnings ...` to show where the warning was created) END RequestId: [UUID] REPORT RequestId: [UUID] Duration: 33.91 ms Billed Duration: 34 ms Memory Size: 1536 MB Max Memory Used: 99 MB Init Duration: 1213.34 ms ```
semmgeorge commented 2 years ago

I got the same issue, "@aws-sdk/client-secrets-manager": "^3.53.0". And I got the same behavior: "Most of the times, everything works, but then unexpectedly crashes". And it crashes at 'await secretsManagerClient.send'

const secretsManagerClient = new SecretsManagerClient({
    credentials: local ? defaultProvider({ profile: AwsProfile }) : undefined,
    region: REGION
});

    static #mySecrets = async (secretName) => {
        let data;
        try {
            data = await secretsManagerClient.send(
                new GetSecretValueCommand({ SecretId: secretName })
            );
            return data; // For unit tests.
        } catch (err) {
            console.log('err', err);
        }
    };
samkotlove commented 2 years ago

Noticing this same behavior on v3.67.0

tux86 commented 2 years ago

We have the same error we had serious problems with our production environment since few weeks ago. I switched to env variables and I disabled secretsManager client.

ppsmol24 commented 2 years ago

Same on our live environment. Initially updated @aws-sdk/client-secrets-manager from version 3.20.0 to 3.52.0. Our lambdas started throwing spikes of the following errors at random intervals throughout the day :

`error.code EPROTO
error.errno -71
error.errorMessage write EPROTO
error.errorType Error
error.stack.0 Error: write EPROTO
error.stack.1 at __node_internal_captureLargerStackTrace (internal/errors.js:412:5)
error.stack.2 at __node_internal_errnoException (internal/errors.js:542:12)
error.stack.3 at WriteWrap.onWriteComplete [as oncomplete] (internal/stream_base_commons.js:94:16)
error.syscall write
errorType AwsError
stack.0 AwsError
stack.1 at /var/task/packages/aws/dist/secretsManager/secretsManager.js:11:38
stack.2 at processTicksAndRejections (internal/process/task_queues.js:95:5)
stack.3 at async Promise.all (index 1)

`

Upgraded then to 3.89.0 thinking the issue may have been fixed in the meantime but encountering the same behavior.

Update : downgrading back down to version 3.20.0 seems to have resolved it for now.

hikarunoryoma commented 2 years ago

We are also seeing this issue on v3.130.0 and have opted for the work around to downgrade to 3.20.0. Any updates @RanVaknin?

jjpepper commented 2 years ago

@AllanZhengYP @RanVaknin is this an issue you've seen. We have been seeing it a quite a few times in recent weeks, with a very recent aws-sdk v3.


ERROR   Error: write EPROTO
    at WriteWrap.onWriteComplete [as oncomplete] (internal/stream_base_commons.js:94:16) {
  errno: -71,
  code: 'EPROTO',
  syscall: 'write',
  '$metadata': { attempts: 1, totalRetryDelay: 0 }
}```
jjpepper commented 2 years ago

@AllanZhengYP @RanVaknin we've investigated this a bit more. It seems that we are seeing the EPROTO error after the lambda times out, and then tries to re-initialise (i.e. we see our cold start code again in the same log group).

sans-jmansfield commented 2 years ago

We recently began moving a variety of microservices from AWS SDK v2 to v3 and have seen flavors of this error in several repos. Most recently with 3.154.0

RanVaknin commented 2 years ago

Hi All,

Unfortunately Im not able to reproduce this issue. We have multiple issues opened for the same EPROTO error, I tried reproducing with 2 customer examples and never ran into this. I assigned it to the dev team to take a look.

jjpepper commented 2 years ago

One of my colleagues had done some analysis and suspects the issue is due to the clock being momentarily wrong when the lambda starts up.

On Fri, 2 Sep 2022 at 1:35 am, Ran Vaknin @.***> wrote:

Hi All,

Unfortunately Im not able to reproduce this issue. We have multiple issues opened for the same EPROTO error, I tried reproducing with 2 customer examples and never ran into this. I assigned it to the dev team to take a look.

— Reply to this email directly, view it on GitHub https://github.com/aws/aws-sdk-js-v3/issues/3513#issuecomment-1234581982, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACQOQYZOFLTOP7PPP5GWKODV4DSOJANCNFSM5SRVDBYA . You are receiving this because you commented.Message ID: @.***>

Dantemss commented 2 years ago

I too have encountered this over and over and my educated guess is that this happens when some (unrelated) code blocks the event loop a bit too long. AWS services seem to have short timeouts when dealing with connections and the SDK does not retry them, so blocking the JS event loop would delay the connection handling and cause the connection to fail with this error.

bdevore17 commented 2 years ago

Having the same issue.

bdevore17 commented 2 years ago

@RanVaknin What's the status here? This is crashing mission critical processes for us and its been assigned P1 for over a month...

bdevore17 commented 1 year ago

@RanVaknin ?????

trivikr commented 1 year ago

On quick revisit during review meeting for issues with p1 labels, we noticed that this issue is likely in Node.js. Search results https://github.com/search?q=repo%3Anodejs%2Fnode+EPROTO&type=issues

We need to find out whether the issue is with the Node.js setup which Lambda follows, or some Node.jsconfiguration which SDK sets, or a bug is Node.js core itself.

The requirement is to provide a minimal repro code which makes multiple secret manager getSecretValue calls. This will help us to log more information, and find out if the issue is specific to Lambda, Node.js or SDK.

For reference, here is a package which attempted to repro npm ping test failure from CodeBuild https://github.com/trivikr/aws-codebuild-npm-ping-test

hikarunoryoma commented 1 year ago

Has anyone found that using a newer version of Node makes this issue go away? I am planning on upgrading my version of Node, but I was curious if anyone else has already tried this.

Like the OP i am also using 14.X, but I am planning on updating to 18.X

michaelmrn commented 1 year ago

We are also seeing this error regularly now and wondering if a node upgrade would help - also node 14 and on latest sdk packages we are using when trying to assume role with stsClient.send(assumeRoleCommand)

james-m-hall commented 1 year ago

We also see this error fairly frequently with @aws-sdk/client-secrets-manager v3.131.0 and a Node 14.x lambda environment.

It looks like the following issues are closely related which implies it may not exclusively be a secrets manager issue:

dgoemans commented 1 year ago

Regularly getting this with the SSM client v3.229.0, NodeJS 14. Seems like it's a global issue across many of the clients

dgoemans commented 1 year ago

Yesterday after posting the above comment, i decided to upgrade my lambdas to Node 16, and so far haven't had this happen. It might be speaking too soon, but @RanVaknin maybe something to pass on to the dev team investigating.

cc @hikarunoryoma (since you asked)

hikarunoryoma commented 1 year ago

@dgoemans Thanks for the heads up! Looking forward to upgrading my lambdas next month and will follow up if I see success on my end!

jeeteshchel commented 1 year ago

We tried Lambda extension for fetching secrets from secrets manager and that has worked quite well https://docs.aws.amazon.com/secretsmanager/latest/userguide/retrieving-secrets_lambda.html

michaelmrn commented 1 year ago

An upgrade to Node18 appears to have resolved this for us

dgoemans commented 1 year ago

Indeed, 6 weeks after upgrading to Node 16 we haven't seen the issue again. Seems to be Node 14 only.

hikarunoryoma commented 1 year ago

I updated from Node 14 -> Node 18 and no longer see this issue! Agreed that this is some issue with Node interfacing with the latest AWS sdk

github-actions[bot] commented 1 year ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs and link to relevant comments in this thread.