Add a convenient way to begin a new trace in a lambda handler

Grundlefleck commented 3 years ago

I recently encountered a problem where individual traces grew too large and reached their quota. I discovered the trace spanned many services, in particular a Step Function with a Map state that invoked a Lambda. The Map state could have up to several hundred input elements, meaning several hundred lambda invocations, each with their many subsegments, from instrumented aws-sdk clients + captureHTTPs.

I decided it fit our usage of X-Ray that the Step Function's trace included the Lambda invocation, but none of its segments or subsegments. After many failed attempts at tweaking Lambda and Step Function configuration, I found an approach via the Lambda code itself. To begin a new trace, I added code as below, to wrap the Lambda handler:

Node 12.x, aws-xray-sdk-core 3.1.0.

import { Segment } from "aws-xray-sdk-core";
import TraceID from "aws-xray-sdk-core/lib/segments/attributes/trace_id";
import * as AWSXRay from "aws-xray-sdk";

export function startNewXrayTraceMiddleware() {
  return (handler) => {
    return async (event, context) => {
      const newTraceId = new TraceID().toString();
      process.env._X_AMZN_TRACE_ID = `Root=${newTraceId};Parent=;Sampled=1`;
      const segment = new Segment("myNewSegmentName", newTraceId, null);
      AWSXRay.setSegment(segment);
      try {
        return await handler(event, context);
      } finally {
        segment.close();
      }
    };
  };
}

This appears to have the desired effect: the Step Function's trace includes the Lambda and it's result, but nothing downstream; and the trace that was created contains segments for all the downstream calls.

Would it be possible to have similar functionality available out-of-the-box? It's not many lines of code, but it feels brittle, and it was very difficult to discover.

This is assuming that there are no nasty side effects or edge-cases I'm unaware of, that would cause the X-Ray/Lambda/SDK teams to recommend avoiding this approach.

srprash commented 3 years ago

Hi @Grundlefleck We typically do not recommend creating a new segment in the lambda function since the lambda function runtime creates one automatically by default when you enbale "Active tracing" on Lambda. Creating multiple lambda segment may produce inconsistent traces and you may see extra segment nodes on the service map. Since you're creating a new trace id for each lambda invocation, how are these function invocation segments connected to the upstream StepFunction segment? Are you not seeing any broken-traces with this logic? Another thing I want to understand is that if you don't want any subsegments created within the lambda function, is it possible to just not instrument the http and aws-sdk clients?

I am trying to understand the use case here and I guess what would really help if you can provide the example service graph and trace map for the end-to-end trace that you want to achieve. And we can discuss how we can incorporate it in the xray SDK.

Thanks!

Grundlefleck commented 3 years ago

Hey, thanks for reaching out. I'll include screenshots and snippets from our test environment, it might be a bit noisy, if it needs to be pared back to be understandable I'll try for an SSCCE.

Since you're creating a new trace id for each lambda invocation, how are these function invocation segments connected to the upstream StepFunction segment?

The StepFunction's segment contains 1x "Invocation" + 1x "Overhead" segment for every Lambda that is invoked:

Here's an example trace map:

(apologies it's a subset, it's hard to get a usable screenshot of the whole map) It shows all the "single-calls" of the StepFunction (which is very useful) but a "black box" for the many Lambda invocations (132 in this example) in the Map state, which each have their own trace ID.

Each Lambda's trace looks something like this:

Which has calls to various AWS services, but also, an HTTP call to a service in another AWS account belonging to our organization. We propagate the trace headers to that service, who then propagate it in their calls. This has proved very useful for coordinating on debugging, but demonstrates how that single trace starting with the StepFunction exceeds it's quota. Splitting the trace at the Lambda let's us get an overview of the entire StepFunction and drill down to inter-account requests, since the shared traces are smaller.

I have not included a Service Map because it's a bit... overwhelming.

Are you not seeing any broken-traces with this logic?

So far I haven't, but I wouldn't go as far as to say definitively that none exist. Based on sampling a few dozen as I made the change, I didn't find anything unexpected or clearly incorrect.

Another thing I want to understand is that if you don't want any subsegments created within the lambda function, is it possible to just not instrument the http and aws-sdk clients?

Herein lies the problem. I do want those subsegments, but I can't have them in the same trace as it grows too big. On separate occasions we have found value in having subsegments from the StepFunction, and the Lambda. Our ideal trace would be to just have a single trace with everything in it, and be usable, but of course there has to be limits somewhere.

Here's how I saw the trade-offs:

if I made no change, I would have zero usable traces
if I turn off instrumentation in the Lambda, I have one usable trace (the StepFunction)
if I declare a new trace in the Lambda, I have one + N usable traces (StepFunction plus one for each Lambda)

Given an individual trace was too big and I wanted all segments available, it became an exercise in deciding how to split up one big trace into many. Since the Lambda invocation in the Map state is the only part that was unbounded (or at least, bounded by the client's request) it seemed natural to split it up there. The StepFunction represents a single request by the client, that we want to treat as an atomic unit, so I was hesitant to split that up, or rearchitect solely to create usable X-Ray traces.

Although, I see the argument that this is a property of the Map state. It could be looping through any task types, and a direct call to Dynamo or SQS or etc wouldn't have this same option of resetting the trace.

I hope this is useful information and clears up some of the mystery. Please let me know if not, I am happy to try again.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs in next 7 days. Thank you for your contributions.

Grundlefleck commented 3 years ago

I am trying to understand the use case

I found myself reluctantly using the startNewXrayTraceMiddleware workaround again. This time when consuming an SNS topic from another account, which propagated the trace's Sampled=0 result from their trace into my account, which I couldn't override. The owners of the other account didn't want to change their sampling rate, and I wanted more traces, and the only workaround found after several pairing sessions with AWS support, was to declare a new trace in my account.

aws / aws-xray-sdk-node

Add a convenient way to begin a new trace in a lambda handler #393