aws / aws-xray-sdk-node

The official AWS X-Ray SDK for Node.js.
Apache License 2.0
271 stars 155 forks source link

XRAY does not work when an app initialises AWS clients that are already initialised in imported libraries #315

Open xtheshumanator opened 4 years ago

xtheshumanator commented 4 years ago

Hello,

I'm seeing a strange issue with XRAY. Apologies if the title isn't clear. I'll describe the issue below:

Setup

import AWSXRay from 'aws-xray-sdk-core';
import AWSNoXray from 'aws-sdk';

const AWS = AWSXRay.captureAWS(AWSNoXray);
const ssmClient = new AWS.SSM();
//do ssm lookup using client.

The libraries and framework are bundled using webpack

Problem

I see XRAY traces lambda firstFunction but not for secondFunction.

Workaround that works

The current workaround is not ideal but here it is:

  1. Expose all exported functions from all libraries in the framework. The app then imports these functions from the framework. So in the app, I go from:

    import {someFn} from 'my-library'

    ...to:

    import {someFn} from 'my-framework'

    Basically, any fn in these libraries that uses AWS clients should be imported via the framework and not directly from the library.

  2. Any AWS client that the app requires should also be imported from the framework. So I had to create another library called my-aws-clients which basically has this:

    import AWSXRay from 'aws-xray-sdk-core';
    import AWSNoXray from 'aws-sdk';
    const AWS = AWSXRay.captureAWS(AWSNoXray);
    export const dynamoClient = new AWS.DynamoDB.DocumentClient({
    service: new AWS.DynamoDB()
    });
    export const snsClient = new AWS.SNS();
    export const smsClient = new AWS.SSM();

    Then I exposed these in the framework so my app can import these AWS clients from the framework.

  3. All of the AWS clients in the libraries were updated to use the clients from my-aws-clients.

Why is the workaround not ideal

I have libraries that do tasks which shouldn't be imported/exposed via the framework; basically it's not their place. For one particular library, I had to copy the entire code into the app so that the AWS interactions of this library happen via the AWS clients that are coming from the framework. This particular library is meant to be used by others apps/teams that have no relation to the framework so I cannot have the dependency on the framework.

What I tried that did not work

  1. I tried to get the libs, framework and app to import from my-aws-clients. But it looks like if I do that, XRAY is re-initialised and the rootId becomes null.
  2. I created a lambda handler function for secondFunction in the app to replace the existing one from the framework. In this function, I did a simple Dynamo query (so no event emitters, etc) to see if there is any problem with event emitters. But this also did not work.

Pls let me know if you require more info.

Thank you!

willarmiros commented 4 years ago

Hi @theshumanator, Sorry for the delay in my response. So it sounds like the captureAWS API is not working when you import the AWS SDK, because some other libraries you depend on also use the AWS SDK and are importing it first. My questions/suggestions are:

  1. Import the X-Ray SDK then call captureAWS(require('aws-sdk')); as the first lines of your Lambda. You should do this instrumentation before importing any other libraries.
  2. Use ES5 syntax for importing rather than ES6. We intercept the module when it is loaded into the package cache, and there is different behavior with the import syntax. So do the capturing as described in the official docs.
  3. If this doesn't work, could you verify you've enabled Active Tracing Mode in the secondFunction, and if so, enable the AWS_XRAY_DEBUG_MODE environment variable and paste in your logs from secondFunction?
xtheshumanator commented 4 years ago

Hi @willarmiros - thanks for your reply. Re your questions/suggestions:

  1. I will try this. I didn't realise the position of the imports will make a diff.
  2. OK
  3. Yep it's enabled. When I had added that debug environment variable, I recall seeing rootId being set to null. I don't have the exact details.

I'm going to try your suggestions and let you know how it goes.

xtheshumanator commented 4 years ago

Hi @willarmiros - unfortunately the suggested changes did not work. I'm including the logs from enabling AWS_XRAY_DEBUG_MODE on the secondFunction.

Thanks.

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   timestamp   |                                                                                   message                                                                                    |
|---------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 1594897864843 | START RequestId: de091992-ead2-4784-99ea-afb0f26f1b27 Version: $LATEST                                                                                                       |
| 1594897864844 | 2020-07-16T11:11:04.843Z undefined DEBUG Starting the AWS X-Ray SDK in automatic mode (default).                                                                             |
| 1594897864846 | 2020-07-16T11:11:04.843Z undefined DEBUG AWS_XRAY_CONTEXT_MISSING is set. Configured context missing strategy to LOG_ERROR.                                                  |
| 1594897864846 | 2020-07-16T11:11:04.846Z undefined DEBUG Subsegment streaming threshold set to: 0                                                                                            |
| 1594897865450 | 2020-07-16T11:11:04.846Z undefined INFO Disabling centralized sampling in Lambda environment.                                                                                |
| 1594897865450 | 2020-07-16T11:11:05.439Z undefined DEBUG Starting the AWS X-Ray SDK in automatic mode (default).                                                                             |
| 1594897865452 | 2020-07-16T11:11:05.450Z undefined DEBUG AWS_XRAY_CONTEXT_MISSING is set. Configured context missing strategy to LOG_ERROR.                                                  |
| 1594897865452 | 2020-07-16T11:11:05.452Z undefined DEBUG Subsegment streaming threshold set to: 0                                                                                            |
| 1594897865516 | 2020-07-16T11:11:05.452Z undefined INFO Disabling centralized sampling in Lambda environment.                                                                                |
| 1594897865516 | 2020-07-16T11:11:05.516Z undefined DEBUG Starting the AWS X-Ray SDK in automatic mode (default).                                                                             |
| 1594897865518 | 2020-07-16T11:11:05.516Z undefined DEBUG AWS_XRAY_CONTEXT_MISSING is set. Configured context missing strategy to LOG_ERROR.                                                  |
| 1594897865518 | 2020-07-16T11:11:05.517Z undefined DEBUG Subsegment streaming threshold set to: 0                                                                                            |
| 1594897865563 | 2020-07-16T11:11:05.518Z undefined INFO Disabling centralized sampling in Lambda environment.                                                                                |
| 1594897865563 | 2020-07-16T11:11:05.563Z undefined DEBUG Starting the AWS X-Ray SDK in automatic mode (default).                                                                             |
| 1594897865564 | 2020-07-16T11:11:05.563Z undefined DEBUG AWS_XRAY_CONTEXT_MISSING is set. Configured context missing strategy to LOG_ERROR.                                                  |
| 1594897865564 | 2020-07-16T11:11:05.564Z undefined DEBUG Subsegment streaming threshold set to: 0                                                                                            |
| 1594897865699 | 2020-07-16T11:11:05.564Z undefined INFO Disabling centralized sampling in Lambda environment.                                                                                |
| 1594897865699 | 2020-07-16T11:11:05.699Z undefined DEBUG Starting the AWS X-Ray SDK in automatic mode (default).                                                                             |
| 1594897865701 | 2020-07-16T11:11:05.699Z undefined DEBUG AWS_XRAY_CONTEXT_MISSING is set. Configured context missing strategy to LOG_ERROR.                                                  |
| 1594897865701 | 2020-07-16T11:11:05.700Z undefined DEBUG Subsegment streaming threshold set to: 0                                                                                            |
| 1594897865701 | 2020-07-16T11:11:05.700Z undefined INFO Disabling centralized sampling in Lambda environment.                                                                                |
| 1594897873010 | 2020-07-16T11:11:13.009Z de091992-ead2-4784-99ea-afb0f26f1b27 DEBUG Ignoring flush on subsegment 950f0a340b322068. Associated segment is missing a trace ID.                 |
| 1594897873011 | 2020-07-16T11:11:13.011Z de091992-ead2-4784-99ea-afb0f26f1b27 INFO {}                                                                                                        |
| 1594897873012 | END RequestId: de091992-ead2-4784-99ea-afb0f26f1b27                                                                                                                          |
| 1594897873013 | REPORT RequestId: de091992-ead2-4784-99ea-afb0f26f1b27 Duration: 7263.85 ms Billed Duration: 7300 ms Memory Size: 512 MB Max Memory Used: 492 MB Init Duration: 1125.06 ms   |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
willarmiros commented 4 years ago

Hi @theshumanator, Have you checked the "Enable Active Tracing" box in your Lambda function configuration? Enabling that is required for the X-Ray SDK to work in Lambda environments.

xtheshumanator commented 4 years ago

Hi @willarmiros - yep it is enabled for all lambdas.

willarmiros commented 4 years ago

Interesting, normally as the last line of logging output for X-Ray enabled Lambdas (after the REPORT line) you'll see something like:

XRAY TraceId: 1-5f11c90f-dcfc745b11de5cc91b061357   SegmentId: 78409cda57500928 Sampled: true

Can you confirm you're seeing this? I'm particularly interested by the Associated segment is missing a trace ID. message.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs in next 7 days. Thank you for your contributions.

willarmiros commented 3 years ago

Reopening as this issue is still being experienced.

frosas commented 3 years ago

We're experiencing a similar issue, if not the same, in the version 2.5 of this SDK.

According to both @theshumanator logs posted above and ours, the SDK is initialised multiple times (note that "Starting the AWS X-Ray SDK in automatic mode"), which causes previous internal states to be lost. In both cases, this seems to occur when multiple libraries that use this SDK are used simultaneously and the dependencies to this SDK can't be flattened (eg, because their versions don't match, or because they are symlinked via Lerna).

I'm not familiar with the implementation details but the cause may be related to a couple of details:

Not sure how feasible it would be to support this use case. The alternative for us would be to expose this SDK as a peer dependency across all of its upstream dependents to ensure a single copy is ever loaded, but that's far from ideal.

EDIT: Fixed possible causes

willarmiros commented 3 years ago

Multiple copies of this library use the same namespace name. I have the impression this still would be an issue after updating to version 3.

@frosas I don't think that several different versions of the X-Ray SDK would cause this, primarily because there are only 4 versions where "Disabling Centralized Sampling in Lambda" is logged and it's printed out 5 times, so I don't think each unique version could cause it. Several libraries depending on the SDK and them not getting flattened sounds more likely.

My follow-up questions are does this repeated initialization logging happen on every invocation, or just on cold starts? And what impact are you observing, are you also not seeing Lambda functions traced? The initialization should be idempotent, but of course there would be problems if it overwrote custom settings of yours.

frosas commented 3 years ago

Multiple copies of this library use the same namespace name. I have the impression this still would be an issue after updating to version 3.

@frosas I don't think that several different versions of the X-Ray SDK would cause this, primarily because there are only 4 versions where "Disabling Centralized Sampling in Lambda" is logged and it's printed out 5 times, so I don't think each unique version could cause it. Several libraries depending on the SDK and them not getting flattened sounds more likely.

Yes. That's why I was referring to multiple "copies", not versions 🙂

My follow-up questions are does this repeated initialization logging happen on every invocation, or just on cold starts? And what impact are you observing, are you also not seeing Lambda functions traced? The initialization should be idempotent, but of course there would be problems if it overwrote custom settings of yours.

As far I remember, the repeated initialization happened only on cold starts.

The impact we observed was segments not being recorded under the correct trace, or not being recorded at all. Our logs also showed the X-Ray request headers not being correctly formed, eg {"name": "X-Amzn-Trace-Id","value": "Root=null;Parent=21f2ee0de8052f62;Sampled=0"} (given our current configuration and traffic, all the requests should be sampled)

If it helps, I can see at least a couple of places where the initialization code is not idempotent:

// packages/core/lib/context_utils.js

cls.createNamespace(NAMESPACE);
logger.getLogger().debug('Starting the AWS X-Ray SDK in automatic mode (default).');

if (process.env.AWS_XRAY_CONTEXT_MISSING) {
  contextUtils.setContextMissingStrategy(process.env.AWS_XRAY_CONTEXT_MISSING);
  contextOverride = true;
} else {
  contextUtils.contextMissingStrategy.contextMissing = contextUtils.CONTEXT_MISSING_STRATEGY.RUNTIME_ERROR.contextMissing;
  logger.getLogger().debug('Using default context missing strategy: RUNTIME_ERROR');
}
// packages/core/lib/aws-xray.js

(function() {
  var data = {
    runtime: (process.release && process.release.name) ? process.release.name : UNKNOWN,
    runtime_version: process.version,
    version: process.env.npm_package_version || UNKNOWN,
    name: process.env.npm_package_name || UNKNOWN
  };

  var sdkData = {
    sdk: 'X-Ray for Node.js',
    sdk_version: (module.exports && module.exports.version) ? module.exports.version : UNKNOWN,
    package: (module.exports && module.exports.name) ? module.exports.name : UNKNOWN,
  };

  segmentUtils.setSDKData(sdkData);
  segmentUtils.setServiceData(data);

  if (process.env.LAMBDA_TASK_ROOT)
    LambdaEnv.init();
})();

Btw, we just rewrote the X-Ray integration so that the library is loaded just once (and then its single instance is injected where needed), and the issue went away.

willarmiros commented 3 years ago

@frosas thank you for the info! I think we can do a couple changes to make the initialization idempotent, namely:

  1. Checking for an existing CLS namespace before initializing the context, so that it's initialized at most once.
  2. Similarly, checking if the Lambda facade segment is already initialized, so we only do the Lambda-specific initialization once.

I have not yet been able to reliably reproduce this issue, but given that your issue was resolved when the library was loaded only once, I think the above two items should fix whatever the underlying bug re-initialization was causing. If you'd like to take a stab at a PR for I'd be happy to review, otherwise I'll add it to our backlog.

frosas commented 3 years ago

We considered that exact same approach, where we would use the SDK via a wrapper including the logic you detailed.

We ended up dismissing that idea, though. By permitting multiple copies of the SDK, there was the possibility of ending up with multiple versions, which would confuse us as it wouldn't be clear which actual version is running.

Other SDK users would find that behavior acceptable though.

Alternatively, you could warn when the SDK init logic detects the presence of the SDK already being initialized. That way, at least user would be able to troubleshoot the issue more easily.