Serverless Distributed Tracing: Trace Extractor/Propagation for batched events

lucashfreitas commented 1 year ago

I am working with a serverless event drive architecture that uses Event Bridge, SQS, and Lambda:

The lambda function (wrapped by datadog-cdk construct) pushes message to Event Bridge.
The event bridge has SQS queues as targets and forward the messages to it.
Lambda function (wrapped by datadog-cdk construct) consumes the message from the SQS queue and sends them into Event Bus again and we move back to step 1.

Our goal is to enable end-to-end traces for this architecture.

1. We wrapped all lambdas (publishers and consumers) with `datadog-cdk` construct but this produced multiple disconnected traces:

Following this documentation https://docs.datadoghq.com/serverless/distributed_tracing/serverless_trace_propagation/?tab=nodejs, I would expect that the trace propagation happens automatically as mentioned here:

Tracing many AWS Managed services (listed here) is supported out-of-the-box and does not require following the steps outlined on this page.

But the traces are not being associated/propagated and I am seeing multiple disconnected traces - not sure if this happens because event bridge invokes the lambda asynchronously, so maybe we really need to "manually" extract the traceContext and pass it through the _datadog field in the event bus.

2. We have implemented a manual trace extractor propagation following datadog documentation:

We have implemented a manual trace propagation following the docs/tutorial https://docs.datadoghq.com/serverless/distributed_tracing/serverless_trace_propagation/?tab=nodejs here and we managed to connect the tracing, but we are now facing another issue to handle/propagate trace for batched events on Lambda functions.

All the examples/docs for trace extraction, even the handler wrapper provided by this library expect to return a single trace per lambda function.

import {datadog} form "datadog-lambda-js"

const lambdaHandler = (event, context) => {
 //my lambda handler
}

export const handler = datadog(handler, {
traceExtractor: (event, context) => {
//datadog expects to return a single trace data here.
}}

If we decide to export a file on the function and set the DD_TRACE_EXTRACTOR we also return a single object.

The issue is that our lambda function actually handles a batch of events coming from an SQS queue (10+) and each of those events might have a different trace context but we are not sure how to handle this using this library or perhaps we should manually use dd-trace library to automatically create the trace and send it to datalog for each event in the batch.

Can someone help or provide if that's not possible to achieve using this library and we really need to use dd-trace to manually create and send the trace to datadog?

Thanks

astuyve commented 1 year ago

Hi @lucashfreitas - thank you for your detailed note.

For EventBridge, as of today we only support Lambda as a direct target to automatically decode and pass trace context. I think we could explore expanding that to SQS/SNS as well as other services we traditionally support, so please feel free to reach out to your account manager to open a feature request.

What I would suggest is doing what you've already done - which brings us to your second point.

Today, Datadog APM doesn't support merging multiple upstream trace contexts. So you'll need to pick one of the SQS messages and use its context as your upstream trace context for the rest of your function execution.

Please feel free to reach out with any additional questions.

Thank you!

lucashfreitas commented 1 year ago

Hey @astuyve, thanks for answering that quickly.

Today, Datadog APM doesn't support merging multiple upstream trace contexts. So you'll need to pick one of the SQS messages and use its context as your upstream trace context for the rest of your function execution.

We are trying to do that but somethings are still not clear, e.g how to define a custom extractor for multiple events inside a lambda. Currently, the extractor function has a 1-to-1 relationship with the lambda handler function as per the example, but how we would extract traces inside a for loop? e.g a lambda handler has 10 events as the payload, so we would need to get the trace context 10 times and send 10 additional traces of the lambda function execution.

We are opening a ticket with datadog to track this.

Thank you

lucashfreitas commented 3 months ago

@astuyve does datadog-lambda-js have any updates for propagating traces using batched events?

bendrucker commented 2 weeks ago

Span links are the tracing primitive for handling workflows like this. A given span can only have one parent trace context, composed of a trace and span ID.

https://docs.datadoghq.com/tracing/trace_collection/span_links/

If you invoke a Lambda, the natural parent is the Lambda invocation span. The span for processing each message can't both be a child of the causing span (PutEvents) and the Invoke span for the Lambda.

Taking EventBridge out of the equation for simplicity, without batches, you can draw a hierarchy from sqs:SendMessage -> lambda:Invoke -> your event handler. Since this package creates the Lambda span, it needs to provide a hook (extractor) to set its trace parent.

But if you receive batches of n>1, this no longer works. You could create a new span for each message and associate it to the trace context in the message, but then it's no longer associated with the Lambda trace.

My recommendation (as another user) would be:

Create a span for each message you're handling in a loop, let it inherit the Lambda invocation trace context as normal.
For each message, extract the trace context and use it to add a span link.

That way you get your linear execution timeline for the batch job (1) but can still have bidirectional references to and from the causing event that enqueued the message (2).

https://datadoghq.dev/dd-trace-js/interfaces/export_.Span.html#addLink.addLink-1

apiarian-datadog commented 2 weeks ago

Hi folks! Thanks @bendrucker for the additional details! Span Links work great when we can propagate the Trace and Span IDs. Good new is that we are also working on automatic span linking for situations where the context cannot be propagated. Our plan is to provide the UI and tracer API enhancements soon and work to enable it for various use cases, starting with S3 objects, but SQS and Dynamo are definitely on the radar.

DataDog / datadog-lambda-js