Intermittent "EC2 Metadata roleName request returned error" (EINVAL) on ECS Fargate

summera commented 4 years ago

[x] I've gone through Developer Guide and API reference
[x] I've checked AWS Forums and StackOverflow for answers
[x] I've searched for previous similar issues and didn't find any solution

Describe the bug I am running a node 12.16 app on ECS Fargate. It's performing operations on files in S3 - streaming from a source bucket and uploading to a destination bucket. About 5 hours ago I started to see the following error when uploading to the destination bucket:

    "originalError": {
        "message": "Could not load credentials from any providers",
        "errno": "EINVAL",
        "code": "CredentialsError",
        "syscall": "connect",
        "address": "169.254.169.254",
        "port": 80,
        "time": "2020-05-28T14:32:43.621Z",
        "originalError": {
            "message": "EC2 Metadata roleName request returned error",
            "errno": "EINVAL",
            "code": "EINVAL",
            "syscall": "connect",
            "address": "169.254.169.254",
            "port": 80,
            "time": "2020-05-28T14:32:43.620Z",
            "originalError": {
                "errno": "EINVAL",
                "code": "EINVAL",
                "syscall": "connect",
                "address": "169.254.169.254",
                "port": 80,
                "message": "connect EINVAL 169.254.169.254:80 - Local (0.0.0.0:0)"
            }
        }
    }

It happened for several minutes and then stopped. Then happened again for a couple minutes about an hour ago and stopped. So it's intermittent. This seems very similar to what was reported in https://github.com/aws/aws-sdk-js/issues/2534#issuecomment-465308420 and asked on the forum here, but has received no answer. I'm using a task role that has PUT permissions on the destination bucket. As I said, this is intermittent so when it's not happening, everything is working as it should. For some reason, it seems that there is an issue pulling credentials from the metadata service.

I'm going to update the SDK to the latest to see if that resolves it but I didn't see anything in the changelog that would indicate it would. Any guidance would be greatly appreciated. Thanks!

Is the issue in the browser/Node.js? Node.js

If on Node.js, are you running this on AWS Lambda? No

SDK version number v2.647.0

ajredniwja commented 4 years ago

Hey @summera thank-you for reaching out to us, while this is very hard to reproduce, is it possible to explicitly set your credentials so that it doesn't touch the metadata depending upon your use case, I understand that should not be the workaround but I would need something more concrete to show to the service team, something which might be reproducible.

Would you be able to share your logs?

summera commented 4 years ago

Hi @ajredniwja. Thank you for the response. After the issue occurred, I updated the SDK to 2.685.0. I also realized that the issue happened during a spike in requests so I scaled up the minimum tasks by one. Since then, I haven't seen the issue occur again. The JSON I included in my first comment (https://github.com/aws/aws-sdk-js/issues/3284#issue-626634869) is coming straight from my logs. Is there something else you were looking to see from the logs?

As for reproducing, I haven't seen this happen since upgrading and scaling up our minimum tasks. However, since this happened during high load when a lot of requests came in and therefore many parallel uploads to S3, I'm wondering if one or more of the following may be possibilities?

Metadata service in Fargate failed to respond under high load for one reason or another.
The SDK is or was not caching credentials retrieved from the metadata service and was therefore hitting the metadata service more than necessary and bombarding it with requests.
Some transient issue happened with the Fargate service and has been resolved.

Do any of the above sound plausible?

ajredniwja commented 4 years ago

Metadata service in Fargate failed to respond under high load for one reason or another.

The SDK is or was not caching credentials retrieved from the metadata service and was therefore hitting the metadata service more than necessary and bombarding it with requests.

Some transient issue happened with the Fargate service and has been resolved.

Do any of the above sound plausible?

I cannot point you towards any of those with complete certainty because we dont have any concrete evidence.

Can you use the following and collect logs for both the cases, in that way we can compare and come to some conclusion

NODE_DEBUG=cluster,net,http,fs,tls,module,timers node app.js

summera commented 4 years ago

I cannot point you towards any of those with complete certainty because we dont have any concrete evidence.

Makes sense, though I was only asking about plausibility. If any of those are not plausible, it makes it easier to focus efforts.

Can you use the following and collect logs for both the cases, in that way we can compare and come to some conclusion

Which two cases are you referring to exactly?

ajredniwja commented 4 years ago

Which two cases are you referring to exactly?

I was talking about case where you see the error and the case where you don't, but I think that might be very hard to catch since this is intermittent error.

summera commented 4 years ago

I was talking about case where you see the error and the case where you don't, but I think that might be very hard to catch since this is intermittent error.

Yea, as I mentioned above, I haven't seen this happen since updating the SDK and increasing the minimum ECS tasks by one, so I don't have any logs to share of this happening again. The fact that it was intermittent and is hard to reproduce is why I was asking what might be plausible to see if it's worth descending the rabbit hole and spending time to investigate further.

gwdp commented 3 years ago

Hi everyone, having exactly the same issue @summera reported with almost the same setup. Very intermittent, have 10-15 clusters, receiving a few thousand requests, and the issue seems to raise once every week, so very rare! Had to set cloudwatch alarms with log filter to get those. So, monitoring very closely.

ECS Task, fargate managed, nodejs 13 image built from node:13.10-alpine, task deployed through CF, have a few ENVs set (nothing new). At code level, using aws-sdk 2.701.0 and usually as my first access is on DynamoDB, the issue arises when querying dynamo.

The weirdest thing is that the issue raises into a task that is running for quite a long time, and in the middle of a bunch of successful requests. That said, I would eliminate any configuration issue, but not SDK; however, the clues (for me) point to ECS metadata service being unavailable for some reason.

One detail is that we use New Relic on some apps, so the trace is faulted for debugging purposes.

Any thoughts?

 ckcpb72fh02l401x5470g8ctn-ckcpb72fh02l501x53ikgcj1p [ERROR] [] CredentialsError: Missing credentials in config, if using AWS_CONFIG_FILE, set AWS_SDK_LOAD_CONFIG=1 - Error: connect EINVAL 169.254.169.254:80 - Local (0.0.0.0:0)
    at internalConnect (net.js:921:16)
    at defaultTriggerAsyncIdScope (internal/async_hooks.js:313:12)
    at net.js:1011:9
    at Shim.applySegment (/usr/src/httpd/node_modules/newrelic/lib/shim/shim.js:1430:20)
    at wrapper (/usr/src/httpd/node_modules/newrelic/lib/shim/shim.js:2092:17)
    at processTicksAndRejections (internal/process/task_queues.js:79:11)

yashbhavsar007 commented 3 years ago

Hi, I am following this doc(https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instance-identity-documents.html ) to select region dynamically in aws. And I tried to test the code in aws ecs fargate it gives me below error

 { Error: connect EINVAL 169.254.169.254:80 - Local (0.0.0.0:0) 
    at internalConnect (net.js:882:16)        
    at defaultTriggerAsyncIdScope (internal/async_hooks.js:294:19)         
    at defaultTriggerAsyncIdScope (net.js:972:9)            
    at process._tickCallback (internal/process/next_tick.js:61:11)        
    errno: 'EINVAL',            
    code: 'EINVAL',            
    syscall: 'connect',            
    address: '169.254.169.254',            
    port: 80 
    }

However, it runs perfectly on ecs ec2 task. I use "aws-sdk": "^2.701.0". It's js code in a docker container. Any solution appretiated.

gwdp commented 3 years ago

Few occurrences this week. @ajredniwja do you believe is better to open an internal ticket for this? Getting worried.

rescio commented 3 years ago

Yes. Let’s open a ticket. We will need to subscribe to dev support no prod account deles. Acho que podes fazer isso usando teu role senão usa o root account. Podes fazer isso por favor?

On Fri, Jul 31, 2020 at 9:37 AM Gabriel Pacheco notifications@github.com wrote:

Few occurrences this week. @ajredniwja https://github.com/ajredniwja do you believe is better to open an internal ticket for this? Getting worried.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/aws/aws-sdk-js/issues/3284#issuecomment-667218206, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALRW6VZKEHRCQ7G265KKO7TR6LXNBANCNFSM4NNHE2ZQ .

ms10398 commented 3 years ago

Getting same issue @ajredniwja

samsullivan commented 3 years ago

Same issue here, definitely think it has something to do with ECS Fargate; although, it does work on some of my S3 put object requests. I tried to disable this request w/AWS_EC2_METADATA_DISABLED, but the error still happens but now it is:

CredentialsError: Missing credentials in config, if using AWS_CONFIG_FILE, set AWS_SDK_LOAD_CONFIG=1

I don't use AWS_* evn vars for credentials, since the ECS Fargate task has access to S3 via my task's IAM role.

Using AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEY env vars to use an IAM User works; but, I should be able to rely on the IAM role built into the ECS Task.

devlargs commented 3 years ago

I keep getting this issue even though my configuration is correct.

joshm91 commented 3 years ago

Same issue here. NodeJS running on fargate. SDK version 2.745.0

Missing credentials in config, if using AWS_CONFIG_FILE, set AWS_SDK_LOAD_CONFIG=1

snash1338 commented 3 years ago

Seeing this exact issue as well. IAM role needs to be fixed

bgardella commented 3 years ago

Bump. We are seeing this too. ECS/Fargate and node.

Error: ENOENT: no such file or directory, open '/root/.aws/config' at Object.openSync (fs.js:440:3) at /usr/src/app/node_modules/dd-trace/packages/dd-trace/src/tracer.js:91:53 at /usr/src/app/node_modules/dd-trace/packages/dd-trace/src/tracer.js:43:56 at Scope._activate (/usr/src/app/node_modules/dd-trace/packages/dd-trace/src/scope/async_hooks.js:51:14) at Scope.activate (/usr/src/app/node_modules/dd-trace/packages/dd-trace/src/scope/base.js:12:19) at DatadogTracer.trace (/usr/src/app/node_modules/dd-trace/packages/dd-trace/src/tracer.js:43:35) at Object.openSync (/usr/src/app/node_modules/dd-trace/packages/dd-trace/src/tracer.js:91:23) at Object.readFileSync (fs.js:342:35) at /usr/src/app/node_modules/dd-trace/packages/dd-trace/src/tracer.js:91:53 at /usr/src/app/node_modules/dd-trace/packages/dd-trace/src/tracer.js:43:56 at Scope._activate (/usr/src/app/node_modules/dd-trace/packages/dd-trace/src/scope/async_hooks.js:51:14) at Scope.activate (/usr/src/app/node_modules/dd-trace/packages/dd-trace/src/scope/base.js:12:19) at DatadogTracer.trace (/usr/src/app/node_modules/dd-trace/packages/dd-trace/src/tracer.js:43:35) at Object.readFileSync (/usr/src/app/node_modules/dd-trace/packages/dd-trace/src/tracer.js:91:23) at Object.readFileSync (/usr/src/app/node_modules/aws-sdk/lib/util.js:95:26) at IniLoader.parseFile (/usr/src/app/node_modules/aws-sdk/lib/shared-ini/ini-loader.js:6:47) at IniLoader.loadFrom (/usr/src/app/node_modules/aws-sdk/lib/shared-ini/ini-loader.js:56:30) at isEndpointDiscoveryApplicable (/usr/src/app/node_modules/aws-sdk/lib/discover_endpoint.js:299:58) at Request.discoverEndpoint (/usr/src/app/node_modules/aws-sdk/lib/discover_endpoint.js:328:8) at Request.callListeners (/usr/src/app/node_modules/aws-sdk/lib/sequential_executor.js:102:18) at Request.emit (/usr/src/app/node_modules/aws-sdk/lib/sequential_executor.js:78:10) at Request.emit (/usr/src/app/node_modules/aws-sdk/lib/request.js:683:14) at Request.transition (/usr/src/app/node_modules/aws-sdk/lib/request.js:22:10) at AcceptorStateMachine.runTo (/usr/src/app/node_modules/aws-sdk/lib/state_machine.js:14:12) at /usr/src/app/node_modules/aws-sdk/lib/state_machine.js:26:10 at Request.<anonymous> (/usr/src/app/node_modules/aws-sdk/lib/request.js:38:9) at Request.<anonymous> (/usr/src/app/node_modules/aws-sdk/lib/request.js:685:12) at Request.callListeners (/usr/src/app/node_modules/aws-sdk/lib/sequential_executor.js:116:18)

antonpirker commented 3 years ago

I had the same problem. It cost me quite some head ache because I had this running in AWS Fargate and debugging is not that easy there.

The error means the Javascript SDK can not find the AWS credentials. If nothing is configured the SDK tries to load the credentials from different places. Here you can see in what order the SDK tries to load the credentials: https://docs.aws.amazon.com/sdk-for-javascript/v2/developer-guide/setting-credentials-node.html

My error was quite embarrassing, I just had a typo in my environment variables. My variable was AWS_ACCESSS_KEY_ID instead of AWS_ACCESS_KEY_ID. (Quite hard to see the difference, right?)

So probably double check the names of your environment variables (or config files)

samsullivan commented 3 years ago

@antonpirker you're supposed to be able to pass an IAM role to a Task's containers in ECS, meaning you should be able to use the Node SDK w/o relying on access/secret IAM keys.

https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task-iam-roles.html

haruharuharuby commented 3 years ago

I encountered the same error, And I have been trying to fix this error. As I guess, Does the ENI temporally or (consistently) down? I focused on 169,254.... IP address in that error. In my case, when the error happened at once, other AWS API calls (not only s3 put ) also did the same behavior. I'll try to confirm this my assuming.

ajredniwja commented 3 years ago

Hey everyone, if there is a reproducible case can you please share it, the internal ticket was opened for the same but there was no reproducible case provided. Seems like to happen under high memory/cpu usage, retrying the request should be considerable.

y04nqt commented 3 years ago

I can either enable or disable my AWS_CONFIG_FILE with the same result. I'm also using AWS.config.update() to update my credentials every time my lambda runs. So I have credentials in both the recommended credentials file and I'm explicitly updating them on the fly to something that worked last week. I'm trying to trigger a lambda from my invocation lambda. In short, PHP sends a cURL request to invokeLambda then the invoker triggers a cron lambda to run instantly. I'm attempting to run all of this locally and it worked in the past, but I haven't found a reason that enabled it to work based on the current issue I'm encountering. I wouldn't consider it intermittent, but something takes place where AWS can load the credentials properly. I think I got lucky by doing specific unknown action versus it magically gets the credentials or it doesn't. @ajredniwja I can hop on a call and we can do debugging together if necessary.

Update: The lambda system does work in the AWS test environment, this issue only occurs for me locally.

y04nqt commented 3 years ago

I also took another route trying SQS/SNS locally. Got all the streams and connection points tied together using AWS CLI.

@ajredniwja I'm able to reproduce this in at least two different ways now.

samsullivan commented 3 years ago

FYI, it may not be a reasonable solution for all, but I confirmed that ECS Fargate works just fine using the v3 AWS Node SDK which came out in General Availability on 12/15: https://aws.amazon.com/blogs/developer/modular-aws-sdk-for-javascript-is-now-generally-available/ https://github.com/aws/aws-sdk-js-v3

y04nqt commented 3 years ago

I'll follow up with the fix for me;

I needed to explicitly set up aws configure in the docker image. Even though my container had all the /.aws/ contents copied over, it wasn't enough for the AWS-SDK to pick it up 'magically'. I suggest ensuring your environment has a profile configured explicitly in the place where you're running your function through the AWS-CLI. This solution resolved the issue for both HTTP and SNS/SQS.

jsantias commented 3 years ago

I use environment variables to pass in the AWS keys and following the naming convention from their docs solved the problem for me. SDK will automatically detect and load the environment variables:

AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
AWS_SESSION_TOKEN

Reference: https://docs.aws.amazon.com/sdk-for-javascript/v2/developer-guide/loading-node-credentials-environment.html

Docker Image: node:14.15.4-buster aws-sdk: 2.789.0

DenisBY commented 3 years ago

Still having this issue with 2.876.0. is there a way to install aws-sdk v3 via npm?

UPDATE: I fixed it with setting task_role_arn in aws_ecs_task_definition

mzl-md commented 3 years ago

We upgraded from NodeJS 12 to 14 and had a successful run after that. We cannot say whether this is just coincidental or whether it is due to the new NodeJS version.

UPDATE: The problem appeared again, so NodeJS 14 is not the solution. 😞

chiderlin commented 2 years ago

i run my code fine on my computer, but get this error when i'm using EC2

docker node version: node:14.15.4-buster aws-sdk: 2.940.0 still not work for me..

UPDATE: I work it on!! I'm using docker-compose, so I try setting volumes in my docker-compose.yml file, and it works.

        volumes:
            - /home/ubuntu/.aws:/root/.aws

-> outside the container : container itself. so inside the container will lead to ~./aws/credentials hope it also works for you.

Zachary-Love commented 2 years ago

i run my code fine on my computer, but get this error when i'm using EC2 docker node version: node:14.15.4-buster aws-sdk: 2.940.0 still not work for me..

UPDATE: I work it on!! I'm using docker-compose, so I try setting volumes in my docker-compose.yml file, and it works.
        volumes:
            - /home/ubuntu/.aws:/root/.aws
-> outside the container : container itself. so inside the container will lead to ~./aws/credentials hope it also works for you.

Hey, just wanted to say that your AWS creds are visible in your image. I recommend revoking them :)

Also, I'm having the same issue as you. Only I'm running an EKS cluster on Fargate and am getting this issue with my pods. I don't run into this issue on an EC2 Node Group though.

*** Update

In my case, we were using Terraform to provision everything in AWS. We use Fargate and IRSA to give our containers permission. What ended up being the issue was that when you create an EKS cluster and an Identity Provider, Terraform will not populate the thumbprint list for the identity provider. We ended up having to populate it ourselves with a TLS certificate.

If you create everything through the AWS management console the thumbprint list is populated automatically for you.

So basically if you have the same error as me, check the thumbprint list of the identity provider.

Hope this helps.

jun0tpyrc commented 2 years ago

seeing this occasionally in some task too

younky-yang commented 2 years ago

I had the similar issue like below when I running directus on fargate { Error: connect EINVAL 169.254.169.254:80 - Local (0.0.0.0:0)
at internalConnect (net.js:882:16)
at defaultTriggerAsyncIdScope (internal/async_hooks.js:294:19)
at defaultTriggerAsyncIdScope (net.js:972:9)
at process._tickCallback (internal/process/next_tick.js:61:11)
errno: 'EINVAL',
code: 'EINVAL',
syscall: 'connect',
address: '169.254.169.254',
port: 80 }

Puneeth-n commented 2 years ago

Hi, We have been facing these issues too since 2 weeks. It randomly starts when we try to emit sns events. Instead of looking for credentials in the ECS metadata endpoint, it is looking at it in in EC2 metadata endpoint which has just permissions to pull docker images.

aws-sdk@2.967.0

Unhandled promise WrappedPromise [Promise] {
  <rejected> Error [AuthorizationError]: User: arn:aws:sts::1234567890:assumed-role/ct-backend-role-prod/i-08411e43f4641408f is not authorized to perform: SNS:Publish on resource: arn:aws:sns:eu-west-1:1234567890:offer-prod
      at Request.extractError (/home/ct/node_modules/aws-sdk/lib/protocol/query.js:50:29)
      at Request.callListeners (/home/ct/node_modules/aws-sdk/lib/sequential_executor.js:106:20)
      at Request.emit (/home/ct/node_modules/aws-sdk/lib/sequential_executor.js:78:10)
      at Request.emit (/home/ct/node_modules/aws-sdk/lib/request.js:688:14)
      at Request.transition (/home/ct/node_modules/aws-sdk/lib/request.js:22:10)
      at AcceptorStateMachine.runTo (/home/ct/node_modules/aws-sdk/lib/state_machine.js:14:12)
      at /home/ct/node_modules/aws-sdk/lib/state_machine.js:26:10
      at Request.<anonymous> (/home/ct/node_modules/aws-sdk/lib/request.js:38:9)
      at Request.<anonymous> (/home/ct/node_modules/aws-sdk/lib/request.js:690:12)
      at Request.callListeners (/home/ct/node_modules/aws-sdk/lib/sequential_executor.js:116:18)
      at Request.emit (/home/ct/node_modules/aws-sdk/lib/sequential_executor.js:78:10)
      at Request.emit (/home/ct/node_modules/aws-sdk/lib/request.js:688:14)
      at Request.transition (/home/ct/node_modules/aws-sdk/lib/request.js:22:10)
      at AcceptorStateMachine.runTo (/home/ct/node_modules/aws-sdk/lib/state_machine.js:14:12)
      at /home/ct/node_modules/aws-sdk/lib/state_machine.js:26:10
      at Request.<anonymous> (/home/ct/node_modules/aws-sdk/lib/request.js:38:9) {
    code: 'AuthorizationError',
    time: 2021-10-06T14:03:43.286Z,
    requestId: 'd816af47-bd0f-51ea-aa8b-f69013f00c53',
    statusCode: 403,
    retryable: false,
    retryDelay: 96.86854539298675
  },
  __asl_wrapper: [Function (anonymous)]
}

JakubJakubowski8 commented 2 years ago

Did someone resolve this issue? I'm facing the same

mzl-md commented 2 years ago

We figured it's some kind of timeout between instance creation and usage of SQS. As a workaround we use getQueueAttributes() after creating the SQS instance and that seems to fix the problem in our case.

ruchisharma189 commented 2 years ago

Facing the same issue could somebody help here ?

gwdp commented 2 years ago

@ruchisharma189 In my experience/case, the intermittent issue was caused by a very high throughput on the metadata API. Metadata API is used by the SDK to retrieve the execution role at every service initialization. Few hundreds of v2 SDK services initialization. e.g.: new AWS.S3({...}) can cause this, mainly on Fargate, but it would happen less frequently on EC2 backed ECS.

By optimizing the SDK services initialization (caching) and later on, migrating to v3 (where credentials are loaded once and then consumed by the services) made the problem disappear from our systems entirely.

Hope it helps :)

sheyDev commented 1 month ago

Same here

(node:49) MetadataLookupWarning: received unexpected error = request to http://169.254.169.254/computeMetadata/v1/instance failed, reason: connect EINVAL 169.254.169.254:80 - Local (0.0.0.0:0) code = EINVAL

Running containers on AWS fargate. What does this mean? Any info is appreciated

aws / aws-sdk-js

Intermittent "EC2 Metadata roleName request returned error" (EINVAL) on ECS Fargate #3284