aws / aws-sdk-js

AWS SDK for JavaScript in the browser and Node.js
https://aws.amazon.com/developer/language/javascript/
Apache License 2.0
7.6k stars 1.55k forks source link

DynamoDB: 500 InternalServerError transactWrite on specific table #3891

Closed mfbx9da4 closed 2 months ago

mfbx9da4 commented 3 years ago

Hi,

Description We are trying to execute the following TransactItems request

documentClient.transactWrite({ TransactItems: [{ Put: { TableName, Item } }] }).promise()

About 50% of the time we get the following error. Before throwing the error, the request hangs for about 20s.

Error [InternalServerError]: Internal server error
    at Request.extractError (/Users/code/node_modules/aws-sdk/lib/protocol/json.js:52:27)
    at Request.callListeners (/Users/code/node_modules/aws-sdk/lib/sequential_executor.js:106:20)
    at Request.emit (/Users/code/node_modules/aws-sdk/lib/sequential_executor.js:78:10)
    at Request.emit (/Users/code/node_modules/aws-sdk/lib/request.js:688:14)
    at Request.transition (/Users/code/node_modules/aws-sdk/lib/request.js:22:10)
    at AcceptorStateMachine.runTo (/Users/code/node_modules/aws-sdk/lib/state_machine.js:14:12)
    at /Users/code/node_modules/aws-sdk/lib/state_machine.js:26:10
    at Request.<anonymous> (/Users/code/node_modules/aws-sdk/lib/request.js:38:9)
    at Request.<anonymous> (/Users/code/node_modules/aws-sdk/lib/request.js:690:12)
    at Request.callListeners (/Users/code/node_modules/aws-sdk/lib/sequential_executor.js:116:18) {
  code: 'InternalServerError',
  time: 2021-09-16T17:10:16.935Z,
  requestId: 'FM8089LSJTQNBP762J7V06NJLVVV4KQNSO5AEMVJF66Q9ASUAAJG',
  statusCode: 500,
  retryable: true
}

Things we have checked

At this point we're out of ideas. The only thing we haven't explored is deleting backups but I don't see why this would affect anything.

😱😒

C'mon AWS! Show me the light!

Thanks

Is the issue in the browser/Node.js? Node.js

If on Node.js, are you running this on AWS Lambda? Yes but also fails locally.

Details of the browser/Node.js version v14.16.1

SDK version number 2.989.0

  β”œβ”€β”¬ aws-cdk@1.102.0
β”‚ β”œβ”€β”€ aws-sdk@2.866.0
β”‚ └─┬ cdk-assets@1.102.0
β”‚   └── aws-sdk@2.866.0 deduped
β”œβ”€β”€ aws-sdk@2.989.0
β”œβ”€β”¬ serverless-domain-manager@3.3.2
β”‚ └── aws-sdk@2.989.0 deduped
β”œβ”€β”¬ serverless-offline@6.9.0
β”‚ └── aws-sdk@2.989.0 deduped
└─┬ serverless@2.51.0
  └── aws-sdk@2.989.0 deduped
ajredniwja commented 3 years ago

Hello @mfbx9da4, thanks for providing all the details as it really helps. Now coming to the issue, at my end, not able to reproduce it but according to the information provided the Internal server error suggests something wrong on the AWS side. You mentioned that "All other tables execute fine with this exact same operation." is there anything different with this table? I am gonna involve someone from the dynamoDb team to have a look as well as they'd be able to see the exact request and what is going around, will update you as soon as I hear back. The work around mentioned in the docs is to retry this error as the request is retry-able. Still reaching out to the dynamoDB team for more info.

ajredniwja commented 3 years ago

Can you please provide:

=> Region=us-east-1|us-west-1|IAD|PDX etc
=> AccountId=123456789012
=> TableName/IndexName=tablename
=> Request Ids: = (looks like ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456790ABCDEFGH9ASUAAJG)
=> Hawkeye link if available

And as many as you can provide in the following (leave the obvious and already provided info):

  1. What is the error rate you are seeing?

=> Transient failures are expected (should be covered with retries) because of network blips, table undergoing partition splits, leader failovers, concurrent request handling resource exhaustion etc. However, prolonged failure rates would be concerning and not expected. => Total volume and Error rate as a percentage. => Was it one-off or ongoing? Some graphs on the DynamoDB client side metrics would be useful.

  1. Do you see errors for write/read or consistent read operations?

=> What is the rough distribution (Operations by % ) of your traffic? => Please provide as much information as possible regarding your traffic pattern, basic use case etc.

  1. Do you have retry policies setup on your requests? If yes, please let us know the policy you are using. => Please confirm that PredefinedRetryPolicies.DYNAMODB_DEFAULT is attached to your ClientConfiguration instance

Recommended ClientConfiguration https://github.com/aws/aws-sdk-java/blob/master/aws-java-sdk-dynamodb/src/main/java/com/amazonaws/services/dynamodbv2/AmazonDynamoDBClientConfigurationFactory.java#L31

=> Use the DYNAMODB_DEFAULT retry policy with DYNAMODB_DEFAULT_MAX_ERROR_RETRY = 10 retry count.

Recommended retry policies: https://github.com/aws/aws-sdk-java/blob/master/aws-java-sdk-core/src/main/java/com/amazonaws/retry/PredefinedRetryPolicies.java#L118

  1. For applications which require very high availability, please consider using eventually consistent (EC) reads instead of consistent reads as EC reads have significantly higher availability.
mfbx9da4 commented 3 years ago

Hello @mfbx9da4, thanks for providing all the details as it really helps.

No problem!

Now coming to the issue, at my end, not able to reproduce it but according to the information provided the Internal server error suggests something wrong on the AWS side.

It does but, as noted, I only get this error when using the javascript SDK which suggests it's something to do with how this SDK forms the request.

You mentioned that "All other tables execute fine with this exact same operation." is there anything different with this table?

Nothing notable other than having a different table name and different partition key / indexes. As mentioned, changing the table name "fixes" the issue and changing the name of the partition key also "fixes" the issue, suggesting there is nothing strange about the table configuration.

The work around mentioned in the docs is to retry this error as the request is retry-able.

This is not a valid workaround given the rate of failure.

Can you please provide:

=> Region=us-east-1
=> AccountId=705843990013
=> TableName=onin-dev-Accounts
=> Request Id: FM8089LSJTQNBP762J7V06NJLVVV4KQNSO5AEMVJF66Q9ASUAAJG, 53K192645GK39RCGVKOI116SLVVV4KQNSO5AEMVJF66Q9ASUAAJG 

Transient failures are expected (should be covered with retries) because of network blips, table undergoing partition splits, leader failovers, concurrent request handling resource exhaustion etc. However, prolonged failure rates would be concerning and not expected.

In production 90% of login requests were failing due to this issue. When testing the exact transactWrite request in serial in a script between 10% and 50% of all transactWrite requests failed. The issue persisted for over 24 hours.

Total volume and Error rate as a percentage.

Approximately 10,000 transactWrite requests between 10% and 50% of them failing.

Was it one-off or ongoing? Some graphs on the DynamoDB client side metrics would be useful.

As detailed above it was ongoing for several days. The client would hang for about 20s before throwing an internal server error.

image (Note I wouldn't read too much into the number of conflict errors as we were trying many things out and the conflict count was likely caused by trying to run a script executing 200 transactWrites in parallel rather than in serial). image image image

Do you see errors for write/read or consistent read operations?

No putItem and getItem-with-consistent-read was consistently succeeding.

What is the rough distribution (Operations by % ) of your traffic?

This was only in a staging environment and never made it to production so the traffic was low. However, it was in a critical part of the app ie login/signup. As mentioned, I wrote a script, over the course of a couple days I tested in the region of 10,000 requests 10% to 50% of them failed.

Please provide as much information as possible regarding your traffic pattern, basic use case etc.

Login/signup is when this operation occurs so fairly infrequently and it was only in staging environment so only used by internal testers.

Do you have retry policies setup on your requests? If yes, please let us know the policy you are using.

Not on this operation

Please confirm that PredefinedRetryPolicies.DYNAMODB_DEFAULT is attached to your ClientConfiguration instance Recommended ClientConfiguration https://github.com/aws/aws-sdk-java/blob/master/aws-java-sdk-dynamodb/src/main/java/com/amazonaws/services/dynamodbv2/AmazonDynamoDBClientConfigurationFactory.java#L31 Use the DYNAMODB_DEFAULT retry policy with DYNAMODB_DEFAULT_MAX_ERROR_RETRY = 10 retry count. Recommended retry policies: https://github.com/aws/aws-sdk-java/blob/master/aws-java-sdk-core/src/main/java/com/amazonaws/retry/PredefinedRetryPolicies.java#L118

I don't think this is relevant. I don't see this option for documentClient but even so the error rate seems way too high for this to be the cause. The error is not transient and it is not at all comparable to the number of errors on other tables.

mfbx9da4 commented 3 years ago

The error has started appearing on the new table name TableName: onin-dev-Accounts-2 RequestID VP8BNTB8MMT53RUR18UNHFHAH7VV4KQNSO5AEMVJF66Q9ASUAAJG

ajredniwja commented 3 years ago

The error has started appearing on the new table name TableName: onin-dev-Accounts-2 RequestID VP8BNTB8MMT53RUR18UNHFHAH7VV4KQNSO5AEMVJF66Q9ASUAAJG

Have taken that to the team

mfbx9da4 commented 3 years ago

Thank you very much! I have also upgraded to a support plan and opened a support case: Case ID 8963449741 as this is quite serious for us.

I used yesno to see the underlying http requests. Something like 90% of server responses are 500s the reason it sometimes succeeds is that the client by default retries and sometimes one of the retries will go through. The reason it hangs for about 20 seconds before outright failing is that it retries 10 times or so before giving up.

ajredniwja commented 3 years ago

@mfbx9da4 I'll try to escalate it, DynamoDb already has a higher retry default (10), I believe bumping that up wont be much of a help, the issue might be something else.

mfbx9da4 commented 3 years ago

Thank you very much @ajredniwja πŸ™‚

matthewmonson commented 3 years ago

Hi @mfbx9da4 , has this issue been resolved, I'm experiencing the exact same.

mfbx9da4 commented 3 years ago

No still not resolved!

stuartleylandcole commented 3 years ago

@mfbx9da4 unfortunately I'm not able to offer you any help 😞 I just wanted to let you know that we experienced identical behaviour with the Java v2 SDK. Thankfully this has been resolved. See this issue for all the details about investigation and the eventual fix in the SDK.

gustavotemple commented 2 years ago

Hi,

Like here https://github.com/aws/aws-sdk-java-v2/issues/1874, we are trying: "perform a transactional write to two tables"

And the behavior is the same: "About 50% of the time we get the following error. Before throwing the error, the request hangs for about 20s."

SDK version number 2.1053.0

@mfbx9da4, do you have any news?

danobri commented 2 years ago

Just ran into this error for the first time today after tearing down and recreating a stack we have been using for load testing. Only happens on one of the recreated tables, but seems to occur on essentially every request. Any status update on this?

BrianArch96 commented 2 years ago

Is there any update on this?

julescsv commented 2 years ago

I'm facing the same issue; I thought I was crazy. C'mon AWS

BrianArch96 commented 2 years ago

@julescsv So it's not a fix but possible workarounds, rename your table from someTable to someTable-1 or whatever you want. Destroying and recreating the table doesn't work, the table name itself needs to be changed.

Alternatively, spam your table with get requests for a couple of hours to invalidate the cache.

Do you know what caused this to happen in your case @julescsv? In our instance, we were added a TTL to our table whilst running tests against it, and that caused something funky.

julescsv commented 2 years ago

Transactions would throw an Internal server error. I would create a table and load data; the first time around, everything was OK; after I deleted the table and recreated it, Transactions would throw an Internal server error. Renaming the table solved the issue, but I shouldn't have to do that.

The request solver of DynamoDb at AWS may be heavily cached; I hope they can create an API endpoint to allow invalidation for a specific table @alexdebrie

LuizGC commented 1 year ago

Is there any news about it? I am receiving 500 with message null frequently.

I am using java aws SDK v2. It has started occurring 3 days ago.

does someone know how to get better error message? Just null message is not helping me to fix the issue.

aBurmeseDev commented 2 months ago

Hey everyone - apologies for the delayed response. After reviewing the comments and linked issues, it appears that there was a known issue with the DynamoDB service where deleting and recreating tables with the same name caused some requests to result in an InternalServerError. However, this issue might have been resolved in a recent version of the service. Could someone please confirm if this is the case? If you are still encountering the same issue with the latest SDK version, kindly provide a minimal reproducible code snippet along with the error trace to assist me in further investigating the matter.

github-actions[bot] commented 2 months ago

This issue has not received a response in 1 week. If you still think there is a problem, please leave a comment to avoid the issue from automatically closing.