aws / aws-sdk-js-v3

Modularized AWS SDK for JavaScript.
Apache License 2.0
3.03k stars 569 forks source link

S3Client Got Corrupted on a particular container due to TimeoutError causing S3 writes to fail #6344

Open rishi2808-ds opened 1 month ago

rishi2808-ds commented 1 month ago

Checkboxes for prior research

Describe the bug

The issue arises because when the AWS credentials expire, the AWS SDK makes a call to fetch new credentials and cache them using the memoize method. If this fetch operation fails and results in a TimeoutError, the AWS SDK’s memoize method caches this error. Consequently, subsequent calls retrieve the TimeoutError from the cache instead of attempting to fetch new credentials from AWS.

To reproduce this issue locally, we removed the credentials from the ~/.aws/credentials file, forcing the SDK to fall back to fromInstanceMetadata method for obtaining credentials, mirroring the same behaviour as on remote environment.

We then explicitly threw an error within the AWS SDK and observed that while the first attempt to fetch credentials triggered an API call to the Instance Metadata Service, subsequent attempts retrieved the error from the cache instead of making fresh API calls to Instance Metadata Service.

Below is the screenshot of snapshot of the values of the hasResult and result variables in the memoize method verifying that the TimeoutError is indeed being cached.

image-20240731-194758

Additional logs added.

Screenshot 2024-08-01 at 12 28 34 PM

First time when its get called we can see the added logs.

Screenshot 2024-07-31 at 9 43 12 PM (1)

Subsequent calls do not show added logs in the AWS SDK, indicating that no new API calls are being made. Instead, we continue to see TimeoutError logs, which means the error is being retrieved from the cache.

Screenshot 2024-07-31 at 7 29 08 PM (1)

SDK version number

aws-sdk/client-s3@3.11.0

Which JavaScript Runtime is this issue in?

Node.js

Details of the browser/Node.js/ReactNative version

v20.10.0

Reproduction Steps

To reproduce issue locally,

Explicitly throw a TimeoutError in the httpRequest function located in node_modules/@aws-sdk/credential-provider-imds/dist/cjs/remoteProvider/httpRequest.js. When this function is called for the first time, it triggers and throws a TimeoutError, which then gets cached in Memoize. On subsequent calls, the function is not invoked again; instead, the cached TimeoutError is returned.

Observed Behavior

We then explicitly threw an error within the AWS SDK and observed that while the first attempt to fetch credentials triggered an API call to the Instance Metadata Service, subsequent attempts retrieved the error from the cache instead of making fresh API calls to Instance Metadata Service.

Expected Behavior

Subsequent calls should call to Instance Metadata Service to fetch credentials when TimeoutError is been stored in cache.

Possible Solution

Subsequent calls should call to Instance Metadata Service to fetch credentials when TimeoutError is been stored in cache.

Additional Information/Context

No response

giri-sh-irke commented 1 month ago

@rishi2808-ds Thanks for posting this in detail. I am seeing a similar issue in my production environment as well. The error gets cached causing all subsequent requests to fail for the container.

aBurmeseDev commented 1 month ago

Hi @rishi2808-ds - thanks for reaching out and providing the detailed explanation.

To better understand and investigate this issue, it would be helpful if you could provide a minimal reproducible code snippet or example. Having a concise code sample that demonstrates the problem you're encountering will assist me in reproducing and analyzing the issue more effectively on my end.

While the information you've provided so far is valuable, having a minimal reproducible code example will allow me to isolate the problem and potentially uncover any nuances or edge cases related to the credential fetching and caching behavior you've described.

Please feel free to share a simplified version of your code, ensuring that it captures the essence of the issue without any unnecessary complexities. This will streamline the investigation process and enable us to collaborate more efficiently in identifying the root cause and potential solutions.

Best, John

rishi2808-ds commented 1 month ago

To reproduce this issue locally, remove the credentials from the ~/.aws/credentials file, forcing the SDK to fall back to fromInstanceMetadata method for obtaining credentials, mirroring the same behaviour as on remote environment.

Explicitly throw a TimeoutError in the httpRequest function located in node_modules/@aws-sdk/credential-provider-imds/dist/cjs/remoteProvider/httpRequest.js. When this function is called for the first time, it triggers and throws a TimeoutError, which then gets cached in Memoize. On subsequent calls, the function is not invoked again; instead, the cached TimeoutError is returned.

Below is the code changes we have done in httpRequest.js file.


Object.defineProperty(exports, "__esModule", { value: true });
exports.httpRequest = void 0;
const property_provider_1 = require("@aws-sdk/property-provider");
const buffer_1 = require("buffer");
const http_1 = require("http");
var flag1 = false;
/**
 * @internal
 */
function httpRequest(options) {
    return new Promise((resolve, reject) => {
        if (!flag1) {
            console.log("http----->0")
            flag1 = true
            reject(new Error("TimeoutError1"));
        }
        const req = http_1.request({ method: "GET", ...options });
        console.log("http----->1")
        req.on("error", (err) => {
            reject(Object.assign(new property_provider_1.ProviderError("Unable to connect to instance metadata service"), err));
        });
        req.on("timeout", () => {
            reject(new Error("TimeoutError"));
        });
        req.on("response", (res) => {
            const { statusCode = 400 } = res;
            if (statusCode < 200 || 300 <= statusCode) {
                reject(Object.assign(new property_provider_1.ProviderError("Error response received from instance metadata service"), { statusCode }));
            }
            const chunks = [];
            res.on("data", (chunk) => {
                chunks.push(chunk);
            });
            res.on("end", () => {
                resolve(buffer_1.Buffer.concat(chunks));
            });
        });
        req.end();
    });
}
exports.httpRequest = httpRequest;
`