aws / aws-xray-sdk-node

The official AWS X-Ray SDK for Node.js.
Apache License 2.0
271 stars 155 forks source link

Node version of aws-xray-sdk-node loses trace when asynchronously invoking another lambda function #477

Closed blimmer closed 2 years ago

blimmer commented 2 years ago

Background

Consider a fairly simple application containing:

In both lambdas, the captureAWS method from aws-xray-sdk is used as recommended by this documentation.

Observed Behavior

The triggering lambda was run from the console.

As you can see, the last segment is a 202 HTTP status code, meaning that the async invocation of the test lambda succeeded.

However, the triggered lambda produced no traces at all:

Even though it was definitely invoked:

Expected Behavior

Because these two lambdas are triggered by the same "logical request", I expected the X-Ray tracing segment generated by the triggering lambda to be passed as metadata to the async invocation of the triggered lambda.

At the very least, even if the traces were not linked, I expected a trace to be produced by the triggered lambda.

Python Behavior

I'm pretty confident that this is an issue with the Node version of the aws-xray-sdk package because everything works as expected when using the Python version of aws-xray-sdk.

You'll see that the X-Ray trace includes the asynchronous invocation in the same trace, which is the expected behavior for Node, as well.

Reproduction Repository

I created a sample aws-cdk repository to demonstrate this issue. Please reference the README on https://github.com/blimmer/xray-async-invocation to deploy the problematic Node lambdas.

This repo also has a branch to deploy Python versions of these lambdas (triggering.py & triggered.py) to show it working properly when using Python: https://github.com/blimmer/xray-async-invocation/tree/use-python . Follow this section of the README to deploy the Python versions of the lambdas.

Other Things I've Tried

I tried some of the various other capture methods (capturePromises & captureHTTPsGlobal(require("http"))) with the Node SDK and they didn't fix the problem.

jcarver989 commented 2 years ago

If you run @blimmer's example code and add the following to both Lambda functions:

console.log(`Trace ID: ${process.env["_X_AMZN_TRACE_ID"]}`)

Then we can see that:

  1. The first Lambda (invoked via "test" button in console) has a "Sampled=1" tracing id, e.g. Root=1-61a7045c-079046695b7fe0fb189338c4;Parent=6464c3a27a7a348e;Sampled=1
  2. The second Lambda (invoked asynchronously by the 1st Lambda) has a "Sampled=0" tracing id, e.g.: Root=1-61a7045e-fbbd717121ef450f63b65376;Parent=762ce5a82bd1b4de;Sampled=0

The Sampled=0 the 2nd Lambda's _X_AMZN_TRACE_ID appears to be the culprit as afaiu that effectively disables x-ray tracing. Furthermore, it seems that for a given call chain the trace id that shows up in the second Lambda (async invoke) does not share the same root segment id (Root=xxx) as the 1st Lambda.

jcarver989 commented 2 years ago

Confirmed the root cause. If you:

  1. Invoke a Lambda via the console that has Sampled=1 in its trace id + instrumented downstream calls with captureAWS.
  2. Have that Lambda asynchronously invoke another Lambda

Then when the async Lambda runs it gets a trace id with Sampled=0 which disables pushing tracing data to X-Ray. Here's the output of the async Lambda's logs with AWS_XRAY_DEBUG_MODE set to true.

2021-12-01T15:24:04.855Z    undefined   DEBUG   Starting the AWS X-Ray SDK in automatic mode (default).
2021-12-01T15:24:04.856Z    undefined   DEBUG   AWS_XRAY_CONTEXT_MISSING is set. Configured context missing strategy to LOG_ERROR.
2021-12-01T15:24:04.862Z    undefined   DEBUG   Subsegment streaming threshold set to: 0
2021-12-01T15:24:04.862Z    undefined   INFO    Disabling centralized sampling in Lambda environment.
START RequestId: 50c21b8c-a136-42cb-8c7b-a7303091374d Version: $LATEST
2021-12-01T15:24:05.137Z    50c21b8c-a136-42cb-8c7b-a7303091374d    INFO    This is the triggered lambda
2021-12-01T15:24:05.148Z    50c21b8c-a136-42cb-8c7b-a7303091374d    INFO    {     "invokingFunctionRequestId": "a554f4f3-fa0e-4f97-b467-80de44fe8cb5" }
2021-12-01T15:24:05.148Z    50c21b8c-a136-42cb-8c7b-a7303091374d    INFO    Trace ID: Root=1-61a79394-d838411846ec1c5f5a43937c;Parent=27ef63f91ef7bfef;Sampled=0
2021-12-01T15:24:05.148Z    50c21b8c-a136-42cb-8c7b-a7303091374d    INFO    Listing buckets to generate a segment...
2021-12-01T15:24:05.291Z    50c21b8c-a136-42cb-8c7b-a7303091374d    DEBUG   Lambda trace data found: Root=1-61a79394-d838411846ec1c5f5a43937c;Parent=27ef63f91ef7bfef;Sampled=0
2021-12-01T15:24:05.329Z    50c21b8c-a136-42cb-8c7b-a7303091374d    DEBUG   Segment started: {     "root": "1-61a79394-d838411846ec1c5f5a43937c",     "parent": "27ef63f91ef7bfef",     "sampled": "0" }
2021-12-01T15:24:06.090Z    50c21b8c-a136-42cb-8c7b-a7303091374d    DEBUG   Ignoring flush on subsegment fa120c0150f16b3d. Associated segment is marked as not sampled.
blimmer commented 2 years ago

Huh, interesting. Is the aws-xray-sdk for Node somehow setting Sampled=0? I wonder why this is different in Python. If you've got the time, you might add the AWS_XRAY_DEBUG_MODE to the use-python branch of the example repository to check it out. I'm happy to merge PRs to my sample repo to make debugging easier for others, as well @jcarver989 😄

willarmiros commented 2 years ago

Hi @blimmer and @jcarver989,

Thank you for this investigation, this is very strange indeed! The SDK's code for running in lambda mode doesn't ever set the SAMPLED state, so I suspect this is a result of some different behavior in the Node.js Lambda runtime (which would also explain why this isn't seen in Python). I will reach out the Lambda team and keep you posted.

MrNilsdotter commented 2 years ago

Hi.

For some additional information; I am in the same place as you @blimmer, and was interested in you example. Was able to replicate your issue, and made an attempt toupgrade to aws-sdk v3.

This solved the issue, and generates the same trace as Python.

The new setups are:

import { Lambda } from "@aws-sdk/client-lambda";
import { S3 } from "@aws-sdk/client-s3";
import { captureAWSv3Client } from "aws-xray-sdk";

const lambda = new Lambda({});
const s3 = new S3({});
captureAWSv3Client(lambda);
captureAWSv3Client(s3);

const handler = async (event:any, context:any) => {
  console.info("This is the triggering lambda");
  console.info(JSON.stringify(event));

  console.info("Listing buckets to generate a segment...");
  const { Buckets } = await s3.listBuckets({});
  console.info(`Found ${Buckets?.length || 0} S3 buckets.`);

  console.info("Async invoking the lambda...");
  await lambda
    .invoke({
      FunctionName: process.env.TRIGGERED_FUNCTION_NAME!,
      Payload: Buffer.from(JSON.stringify({
        invokingFunctionRequestId: context.awsRequestId,
      })),
      InvocationType: 'Event' 
    });
  console.info("Invoked lambda. Ending!");
};

export { handler };

and

import { captureAWSv3Client } from "aws-xray-sdk";
import { S3 } from "@aws-sdk/client-s3";

const s3 = new S3({});
captureAWSv3Client(s3);

const handler = async (event:any) => {
  console.info("This is the triggered lambda");
  console.info(JSON.stringify(event));

  console.info("Listing buckets to generate a segment...");
  const { Buckets } = await s3.listBuckets({});
  console.info(`Found ${Buckets?.length || 0} S3 buckets.`);
};

export { handler };

In package.json

"@aws-sdk/client-lambda": "^3.76.0",
"@aws-sdk/client-s3": "^3.76.0",

while uninstalling aws-sdk

The conversion to v3 might be a rough and out of scope, but hope this helpst with identifying the issue.

willarmiros commented 2 years ago

Hello,

After investigating with the Lambda team, it turns out this issue is the result of using the InvokeAsync API for invoking the Lambda function. Using InvokeAsync is deprecated in favor of Invoke with InvocationType set to 'Event', the deprecated API does not guarantee the same behavior as using the recommended API: https://docs.aws.amazon.com/lambda/latest/dg/API_InvokeAsync.html

We've confirmed this issue is fixed when using the Invoke API.