aws / aws-sdk-js

AWS SDK for JavaScript in the browser and Node.js
https://aws.amazon.com/developer/language/javascript/
Apache License 2.0
7.59k stars 1.55k forks source link

IRSA with aws-sdk getting "InvalidToken: The provided token is malformed or otherwise invalid." #4140

Closed james64 closed 1 month ago

james64 commented 2 years ago

Describe the bug

This has been already reported (for example #3697) but is closed so I am opening a new one.

Using js aws-sdk with IRSA auth to upload a file to s3 bucket results in InvalidToken: The provided token is malformed or otherwise invalid..

Running aws s3 cp <file> s3://<bucket> in a pod is sucessfull. Running same cmd using js sdk (see reproduction steps) results in the error.

Expected Behavior

Successfull file upload.

Current Behavior

Running reproducing js script (see below) results in this log:

[AWS sts 200 1.94s 0 retries] assumeRoleWithWebIdentity({
  WebIdentityToken: 'eyJ...wJg',
  RoleArn: 'arn:aws:iam::<acccountNum>:role/<roleName>',
  RoleSessionName: 'token-file-web-identity'
})
[AWS s3 400 1.978s 0 retries] putObject({
  Key: 'res/putobject',
  Body: <Buffer 74 65 73 74>,
  Bucket: '<bucket>'
})
(node:9259) UnhandledPromiseRejectionWarning: InvalidToken: The provided token is malformed or otherwise invalid.
    at Request.extractError (/uloha/node_modules/aws-sdk/lib/services/s3.js:711:35)
    at Request.callListeners (/uloha/node_modules/aws-sdk/lib/sequential_executor.js:106:20)
    at Request.emit (/uloha/node_modules/aws-sdk/lib/sequential_executor.js:78:10)
    at Request.emit (/uloha/node_modules/aws-sdk/lib/request.js:686:14)
    at Request.transition (/uloha/node_modules/aws-sdk/lib/request.js:22:10)
    at AcceptorStateMachine.runTo (/uloha/node_modules/aws-sdk/lib/state_machine.js:14:12)
    at /uloha/node_modules/aws-sdk/lib/state_machine.js:26:10
    at Request.<anonymous> (/uloha/node_modules/aws-sdk/lib/request.js:38:9)
    at Request.<anonymous> (/uloha/node_modules/aws-sdk/lib/request.js:688:12)
    at Request.callListeners (/uloha/node_modules/aws-sdk/lib/sequential_executor.js:116:18)
(node:9259) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). To terminate the node process on unhandled promise rejection, use the CLI flag `--unhandled-rejections=strict` (see https://nodejs.org/api/cli.html#cli_unhandled_rejections_mode). (rejection id: 1)
(node:9259) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.

Reproduction Steps

In k8s run pod with IRSA setup. In the pod run the testing js script taken from #3697 using "aws-sdk": "^2.1164.0" as dependency:

const aws = require('aws-sdk');

aws.config.update({
  logger: console,
});

const s3 = new aws.S3({
  region: 'me-south-1',
  params: {
    Bucket: 'oneid-doc-sign-prs'
  },
});

(async function() {
  const response1 = await s3
    .putObject({ Key: 'res/putobject', Body: Buffer.from('test') })
    .promise();
  console.log(response1);
  console.log('done1');

  const response2 = await s3.upload({ Key: 'res/upload', Body: Buffer.from('test') }).promise();
  console.log(response2);
  console.log('done2');
})();

Optionally run same through aws-cli to verify it works.

Possible Solution

No idea to be honest :)

Additional Information/Context

Upload to the bucket is perfectly accessible through all of these methods

SDK version used

2.1164.0

Environment details (OS name and version, etc.)

ubuntu image on top of amazon linux os host. K8s 1.21

james64 commented 1 year ago

Same issue with 2.1216.0 still.

ajredniwja commented 1 year ago

@james64 thanks for opening this issue and apologies it fell out of queue. I am getting a similar error too, I'll investigate more and post my findings.

ajredniwja commented 1 year ago

Running the script with the latest version doesn't error out for me.

{
  Expiration: 'expiry-date="Sun, 23 Oct 2022 00:00:00 GMT", rule-id="YzZhYjc4MmEtYTAzNS00ZGY0LWIwYmItYWZhisdhknmsid"',
  ETag: '"098f6bcd4621d373cade48789283ef3"',
  VersionId: '_WpryRFHXN09GqqtPjDidajojda93'
}
done1
{
  Expiration: 'expiry-date="Sun, 23 Oct 2022 00:00:00 GMT", rule-id="YzZhYjc4MmEtYTAzNS00ZGY0LWIwYmItYewiouweiojJHidhoj"',
  ETag: '"098f6bcd4621d373cade4e832340940294Kj"',
  VersionId: 'AlhdjSKjdkAL9vpBe2235pE6arQPoEN21',
  Location: 'https://bucket.us-west-2.amazonaws.com/res/upload',
  key: 'res/upload',
  Key: 'res/upload',
  Bucket: 'bucket'
}
done2

Can you please share the steps you follow for setting up the credentials?

james64 commented 1 year ago

Thanks for trying this out. Our setup:

  1. Spin up EKS cluster
  2. Create OIDC IAM identity provider with url from EKS cluster
  3. Create policy allowing bucket access
  4. Using this terraform module (in this version) to create iam role assumable by OIDC identity. We have a condition to allow only service account from given namespace to assume (these must be equal "oidc.eks.me-south-1.amazonaws.com/id/<clusterid>:sub" = "system:serviceaccount:exampleNs:exampleAccount". Attach bucket policy to this role.
  5. In exampleNs create exampleAccount service account and annotate with eks.amazonaws.com/role-arn: <arg_of_oidc_assumable_role>

Then we spin up a pod which just runs ubuntu with long sleep and using example service account. In this pod:

$ apt-get update
$ apt-get install awscli npm vim
$ mkdir test && cd test
$ npm init --yes
$ vim package.json # add dependency for "aws-sdk": "^2.1233.0"
$ vim run.js # copy paste reproduction script verbatim
$ npm install 
$ node run.js
... produces same error as stated above ...
  (node:9134) UnhandledPromiseRejectionWarning: InvalidToken: The provided token is malformed or otherwise invalid.

$ aws --region me-south-1 s3 cp package.json s3://oneid-doc-sign-prs/
upload: ./package.json to s3://oneid-doc-sign-prs/package.json # upload successful

$ env | grep AWS | grep -o '^.*=' # to see that no other AWS envs are set
AWS_DEFAULT_REGION=
AWS_REGION=
AWS_ROLE_ARN=
AWS_WEB_IDENTITY_TOKEN_FILE=

awscli upload immediately afterwards using irsa credentials worked. Node example script with latest version failed. Do you see any difference between our and your setup?

james64 commented 1 year ago

@ajredniwja any luck replicating the issue? Maybe you can share description of your setup so I can help spotting the difference.

BMayhew commented 1 year ago

I ran into this issue, I solved it by ensuring that the below environment variables were not set. (we migrated from using the secret key to OIDC) and I had to go unset this in our CI pipeline before succeeding.

AWS_SECRET_KEY AWS_SECRET_ACCESS_KEY

I came to this conclusion based this doc and the order in which things load - https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/auth/DefaultAWSCredentialsProviderChain.html

james64 commented 1 year ago

@BMayhew thanks for the post :+1: However we do not have these envs set. List of aws related envs we set is seen in one of my previous post.

RanVaknin commented 1 month ago

Hi @james64,

I found this issue while combing through our v2 backlog. It sounds like the AssumeRoleWithWebIdentity that the SDK makes to STS to exchange the OIDC token with a set of credentials is failing (happens under the hood). The SDK's EKS credential provider will attempt to read the token from disk. My guess is that this is either failing to read the token from the file system, or the token is in a format it does not expect.

I'm not sure why this fails and why it works on the CLI, as it is hard to point to the exact point of failure in this flow.

You can do the following:

  1. Since this issue was reported over a year ago, try to pull the latest version of the SDK, perhaps this was addressed.
  2. Once the pod is up and running, exec into it and try to log the contents of the token to see if it's actually written to disk.
  3. try to log the internal STS call that the SDK makes under the hood. Unfortunately I dont think there's a way the JS v2 SDK can log that internal call, but you might be able to see it using cloudtrail, or monitoring the network activity on the pod with a network profiling tool. That would show you the outgoing request and and if it is sent correctly with the internal STS call.

Finally, I would say consider upgrading to v3. The EKS credential provider is implemented differently, and the SDK offers much better logging capabilities allowing you to do more self debug.

FWIW I just tested it with the v2 SDK on my EKS cluster and it works perfectly:

$ kubectl exec --stdin --tty repro -- /bin/bash

bash-5.2# cd repro/
bash-5.2# cat v2.js 
const AWS = require('aws-sdk');

AWS.config.logger = console;
const ssm = new AWS.SSM();

(async () => {
    try {
        const params = {
            Name: 'some-name',
            WithDecryption: true
        };
        const response = await ssm.getParameter(params).promise();
        console.log(response);
    } catch (error) {
        console.log(error);
    }
})();
bash-5.2# node v2.js 
(node:778) NOTE: The AWS SDK for JavaScript (v2) will enter maintenance mode
on September 8, 2024 and reach end-of-support on September 8, 2025.

Please migrate your code to use AWS SDK for JavaScript (v3).
For more information, check blog post at https://a.co/cUPnyil
(Use `node --trace-warnings ...` to show where the warning was created)
[AWS sts 200 0.137s 0 retries] assumeRoleWithWebIdentity({
  WebIdentityToken: '***SensitiveInformation***',
  RoleArn: 'arn:aws:iam::REDACTED:role/REDACTED',
  RoleSessionName: 'token-file-web-identity'
})
[AWS ssm 200 0.191s 0 retries] getParameter({ Name: 'some-name', WithDecryption: true })
{
  Parameter: {
    Name: 'some-name',
    Type: 'String',
    Value: 'some-value',
    Version: 1,
    LastModifiedDate: 2024-07-03T20:13:37.758Z,
    ARN: 'arn:aws:ssm:us-east-1:REDACTED:parameter/some-name',
    DataType: 'text'
  }
}

Let me know how it goes. Ran~

james64 commented 1 month ago

@RanVaknin thanks a lot for digging into this issue. Unfortunately I am no longer with the project where we have encountered this issue. Also I believe the particular service where this happened was migrated to different auth method. I would love to investigate more. But I do not think I can re-create the setup exactly as it was. So let's just close this issue. Thanks again