aws / aws-xray-sdk-node

The official AWS X-Ray SDK for Node.js.
Apache License 2.0
271 stars 155 forks source link

X-Ray segment longer than lambda function execution time #468

Open RealityCtrl opened 3 years ago

RealityCtrl commented 3 years ago

We have a serverless typescript lambda function which makes API calls to other APIs. We are using the aws-xray-sdk to instrument downstream API calls that are made with axios. Currently using version 3.3.3 of the SDK.

We are seeing the execution time of those downstream API calls as substantially higher that the actual lambda function execution time and the downstream APIs does not see a lambda execution time corresponding to this execution time.

In the example below the API call is listed as 1.06 seconds execution time but the function only executed for 275ms.

What could be causing this discrepancy?

Screenshot 2021-10-11 at 09 47 56

Cloudwatch log message also show 269 ms execution time.

REPORT RequestId: 1bd3866d-a4ff-4d12-9e42-8fddc98853a4  Duration: 268.37 ms Billed Duration: 269 ms Memory Size: 512 MB Max Memory Used: 138 MB 
XRAY TraceId: 1-6163f603-7b02f3fd233c022a431a11d1   SegmentId: 6c8548f51f835b8b Sampled: true   

Code snippets for the implementation of tracing.

import express from 'express';
import AWSXRay from 'aws-xray-sdk';
import { AxiosRequestConfig } from 'axios';

import Redis from '@libs/cache/cache';
import InstrumentedCache from '@libs/cache/instrumentedCache';
import serverlessExpress from '@vendia/serverless-express';

const http = require('http');
const https = require('https');

AWSXRay.captureHTTPsGlobal(http, false);
AWSXRay.captureHTTPsGlobal(https, false);
AWSXRay.capturePromise();

const axiosConfig: AxiosRequestConfig = {
  httpAgent: new http.Agent(),
  httpsAgent: new https.Agent(),
};
app.use(AWSXRay.express.openSegment('mc-moderation-api'));
app.use(connectToCache(instrumentedRedis));
app.use(disconnectFromCache(instrumentedRedis));
app.use('/', router);
app.disable('x-powered-by');
app.use(AWSXRay.express.closeSegment());
app.use(unknownRouteHandler);
app.use(unexpectedErrorHandler);
srprash commented 3 years ago

Hi @RealityCtrl I'm not very familiar with axios but it seems like it's an async HTPP client. I'm guessing what's happening here is that your async request is taking probably taking 1.06s to complete but in the meantime the execution moves forward and the lambda segment is closed before that async request completes and the corresponding subsegment is emitted later on. But I'm also confused by the fact that even if the HTTP request is taking that long, once the lambda finishes execution in 269ms, this subsegment shouldn't be emitted after the function terminates.

I'll let @willarmiros take a stab at this one since he may have more idea.

RealityCtrl commented 3 years ago

You are correct the axios.get is asynchronous, I can't share the code but we do await the result of the API call. For these calls the API call is completing to our knowledge and the downstream system does not have a response time corresponding to the longer time. We also see the same thing with using captureAsyncFunc for instrumenting a connection to redis elasticache and see times on that longer than the configured max function execution time of 15s.

e.g. this one apparently took 17 seconds to disconnect from elasticache but the function completed in 612ms. Screenshot 2021-10-12 at 09 56 12 e.g. this one apparently took 22 seconds to disconnect from elasticache but the function completed in 887ms. Screenshot 2021-10-12 at 09 48 31 Screenshot 2021-10-12 at 09 48 40

willarmiros commented 3 years ago

Hi @RealityCtrl,

The only explanation I can think of is that somehow the Lambda function is completing its invocation and being frozen before the subsegments are closed, causing them to be closed during the next invocation and reflect a time longer than the function invocation. It will be very difficult to investigate further without code reproducing the issue, and I know these production Lambda issues tend to be very difficult to reproduce.

I would also recommend trying out the AWS Distro for OpenTelemetry JavaScript Lambda layer, it can auto-instrument your downstream requests & send trace data to X-Ray without modifying your code. Maybe the issue won't appear with that.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs in next 7 days. Thank you for your contributions.

Ecno92 commented 6 months ago

I can report that I am experiencing the same issues in on of our environments. Is there any plan to look further into this issue? We are using aws-xray-skd 3.6.0.

I will check out if we can change to OpenTelemetry, but I still would expect that an issue like this deserves a fix in a library like this.