aws / aws-xray-sdk-node

The official AWS X-Ray SDK for Node.js.
Apache License 2.0
272 stars 155 forks source link

Having Error: connect ECONNREFUSED #217

Closed kwongkz closed 4 years ago

kwongkz commented 4 years ago

Anyone got idea why this happen? I'm using serverless lambda and with X-Ray tracing turned on

WARN Error: connect ECONNREFUSED 169.254.79.2:2000 at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1107:14)

willarmiros commented 4 years ago

Hi @kwongkz, This doesn't sound like an error associated with X-Ray. Please provide a code snippet to reproduce this error as well as logs or another indication that X-Ray is causing this so I can further assist you.

kwongkz commented 4 years ago

Hi @willarmiros,

I guess I found out the issue & solution already.

I did reference this issue to get the idea #143

So the solution for me add env - AWS_NODEJS_CONNECTION_REUSE_ENABLED=1, you can refer to this Lambda optimization tip.

Hope we can put this in the documentation so other people can get more clear to use it.

Thanks for the assist.

petermorlion commented 4 years ago

I'm having the same issue.

Searching the internet for that IP address leads me to believe it's the IP address of the X-Ray daemon (see here). It definitely should be a "local" IP address because it starts with 169. So I think it should be somewhere in the AWS network, though I'm no network specialist.

I've tried adding the AWS_NODEJS_CONNECTION_REUSE_ENABLED environment variable and set it to 1. But I'm still getting the issue. I will further investigate and update if I find anything.

awssandra commented 4 years ago

Hi petermorlion,

Is this affecting you on Lambda as well? The Daemon should be automatically configured in Lambda, no need to include it or set it up.

petermorlion commented 4 years ago

Hi awssandra,

Yes, this is on AWS Lambda. I have it with several Lambda's, all of which use the AWS X-Ray Express package. Strangely, I don't seem to be having the issue when only using the core, but also not when using AWS X-Ray Express with NestJS (both in Lambda's). Though those Lamba's are executed less often. I'll see if I can write a minimal Lambda and execute a load test on them?

davidcheal commented 4 years ago

I am also seeing this error in x-ray traces. xray-error I am using the express package, node 12.x

petermorlion commented 4 years ago

I've been able to reproduce this on Node 10.x with this piece of code:

const AWSXRay = require('aws-xray-sdk-core');
const xrayExpress = require('aws-xray-sdk-express');
const express = require('express');
const serverlessHttp = require("serverless-http");

module.exports.handler = async function(event, context) {
    const app = getApp();
    const slsHttp = serverlessHttp(app);
    const result = await slsHttp(event, context);
    return result;
}

function getApp() {
    const app = express()

    app.use(xrayExpress.openSegment('PMO-xray-error-test'));

    app.get('/', function (req, res) {
      res.send('Hello World')
    })

    app.use(xrayExpress.closeSegment());

    return app;
}

I just invoked the API Gateway several times from the AWS Console. So this is no heavy load test, i.e. no concurrent requests. As you can see, some invocations have no issue, but others do: image

After that, other requests work fine again. So there's no real pattern I can deduce.

Things I've tried but that didn't make a difference:

awssandra commented 4 years ago

Sorry for the delayed response!

I'm thinking there's a disconnect between the custom Lambda code and the Express middleware. Each have their own expected workflow of the daemon and SDK behavior. We'll take a deep dive into this.

willarmiros commented 4 years ago

Hi @petermorlion, I am investigating this issue with the Lambda team. Please sit tight for any updates!

petermorlion commented 4 years ago

@willarmiros I don't mean to put pressure on you, but I'm curious if there is any progress on this?

willarmiros commented 4 years ago

Hi @petermorlion,

After some further inspection, it appears the root cause is in our service connector here. It's from a poller that runs in the background to retrieve sampling rules from X-Ray's service back end roughly every 5 minutes (speaking of patterns you should see the error about every 5 minutes if you're consistently making requests uninterrupted by cold starts). These requests are attempting to communicate directly with the daemon, which is not possible in Lambda environments.

I changed the way we make these requests in #255 to no longer be lazy, and that actually appears to have made the errors appear instantly upon invocation. I'm going to make a PR to disable these requests for now in Lambda environments, since we don't support sampling configuration in Lambda yet.

avin-kavish commented 4 years ago

These requests are attempting to communicate directly with the daemon, which is not possible in Lambda environments.

@willarmiros Why isn't this possible? Is the X-Ray service on a private network?

Also, when you disable them, how will you be checking for it? This code is essentially express code and express code is not aware that it is being run in a lambda environment.

In the meantime, will I be able to stop the api call with,

AWSXRay.middleware.disableCentralizedSampling();

?

willarmiros commented 4 years ago

Hi @avin-kavish, Sorry, that wasn't entirely correct. It is possible to communicate with the daemon in Lambda environments, but only over UDP connections (see segment_emitter). The problematic requests that we're making use TCP under the hood. I believe that the Lambda service has some tight iptables configurations prohibiting these.

I will check for Lambda environments using the LAMBDA_TASK_ROOT environment variable, which is how we make the check elsewhere in the SDK. The disableCentralizedSampling() call should prevent these errors, that's a great call out. However to minimize the burden on other customers I'll still just disable sampling by default in Lambda until we get better support for it.

willarmiros commented 4 years ago

This fix was released in v3.0.0-alpha.2.