DataDog / datadog-lambda-js

The Datadog AWS Lambda Library for Node
Apache License 2.0
113 stars 35 forks source link

Tracing context extraction from SNS-SQS Events does not work with Raw Message Delivery #579

Open mahwy opened 1 month ago

mahwy commented 1 month ago

Expected Behavior

Tracing context injected to MessageAttributes._datadog.BinaryValue by dd-trace-js should be extracted.

Actual Behavior

TraceContextExtractor.getTraceEventExtractor() is unable to extract tracing context from SNS-SQS Events with Raw Message Delivery

Steps to Reproduce the Problem

  1. Publish a SNS message with raw message delivery enabled
  2. Push the SNS message to a SQS topic
  3. Subscribe the SQS topic in a AWS lambda function

Root Causes

When publishing a SNS message, dd-trace-js injects the tracing context under MessageAttributes._datadog -> BinaryValue

messageAttributes._datadog = {
  DataType: 'Binary',
  BinaryValue: ddInfo
}

Problem 1

Since an event from raw message delivery is missing Type and TopicArn, isSNSSQSEvent returns false and isSQSEvent returns true. So it picks SQSEventTraceExtractor, which only checks stringValue.

Problem 2

Manually setting Type and TopicArn in the event body would force [SNSSQSEventTraceExtractor] (https://github.com/DataDog/datadog-lambda-js/blob/main/src/trace/context/extractors/sns-sqs.ts) to be picked, which supports both String and Binary data type. But the extractor is reading from MessageAttributes._datadog.Value instead of MessageAttributes._datadog.BinaryValue so no context is extracted.

Specifications

astuyve commented 1 month ago

Hey @mahwy! Thanks for the note.

We intentionally don't support Raw Message Delivery because, although we can pass trace context, we rely on the SNS and SQS metadata to provide the timestamps used for when the message was first sent to SNS and again when SNS delivered it to SQS. Without that, we're not able to determine when the message was originally published, only that it showed up to SQS at the time in the SQS wrapper.

Instead we've asked people to disable raw message delivery if they want to trace through both SNS and SQS. We could certainly enable trace context extraction if that's something critical for you, but it won't create a coherent trace the way it does without raw message delivery.

Can you help me understand why raw message delivery is critical to your workflow?

Thanks! AJ

mahwy commented 1 month ago

@astuyve I'm receiving trace context even with raw message delivery. (Screenshot attached)

Image messageAttributes._datadog was injected by dd-trace-js

I think the issue can be fixed by letting the SQSEventTraceExtractor read binaryValue in addition to stringValue

astuyve commented 1 month ago

Hi @mahwy yes trace context is there but that is not enough. We need the timestamps for tracing to fully work, see: https://github.com/DataDog/datadog-lambda-js/blob/main/src/trace/span-inferrer.ts#L295

which isn't present when raw message delivery is enabled.

Can you help me understand why it's required for your use case?

mahwy commented 1 month ago

We have existing AWS Lambda functions subscribing a SQS topic. Events are first published to a SNS topic and the SQS topic is subscribing the SNS topic. These SQS subscriber lambda functions are unable extract the trace contexts from the events creating two separate traces for each request chain. As a workaround, I've added some code to inject messageAttributes._datadog with StringValue when publishing messages to the SNS topic, which fixes the problem.

const params: SNS.Types.PublishInput = {...};
  const activeSpan = tracer.scope()?.active();
  if (activeSpan) {
    const ddInfo = {};
    tracer.inject(activeSpan, 'text_map', ddInfo);
    if (Object.keys(ddInfo).length !== 0) {
      if (!params.MessageAttributes) {
        params.MessageAttributes = {};
      }
      params.MessageAttributes._datadog = {
        DataType: 'String',
        StringValue: JSON.stringify(ddInfo),
      };
    }
  }
 await sns.publish(params)
onyxraven commented 1 month ago

I just ran across this same issue. We have a number of instances where we desire raw message delivery for the ease of use and efficiency (especially if we have messages that near the message size limit), plus the ability to use message filtering.

The timestamps in the message are only the SQS ones, and the SNS ones are left out? is that the main issue? I wonder if theres a way we could 'opt in' to still being able to pull these in?

The side problem is I tried the workaround proposed by @mahwy, but the auto instrumentation injection is overriding it I think, and making it still a binaryValue. Is there a way I can opt out of that conditionally on a call? Or another way to work around this?

mahwy commented 1 month ago

@onyxraven right, it doesn't work with dd-trace auto-instrumentation. I had to exclude aws-sdk from esbuild external config. You might be able to disable aws-sdk instrumentation via DD_TRACE_DISABLED_PLUGINS env var.

astuyve commented 1 month ago

Hi @onyxraven, thanks for the note

I just ran across this same issue. We have a number of instances where we desire raw message delivery for the ease of use and efficiency (especially if we have messages that near the message size limit), plus the ability to use message filtering.

As far as I know the SNS message attributes are fixed length and quite small, and you still can use SNS based message filtering, but I may be mistaken! Please correct me if I am.

We can add support to extract trace context from raw message delivery, but again the spans won't line up timing wise (because the metadata from SNS is missing – we're blind to it), and the connection would be from an SDK call to SNS -> [gap] -> SQS.

If that's something which would work for you all, I can help prioritize this effort.

onyxraven commented 1 month ago

IMO, thats a reasonable thing to opt into if possible? a config flag or env var we can set?

astuyve commented 1 month ago

Yeah I think it's totally reasonable. I think we'll try to make it automatic and then when customers ask why SNS is missing we can point them to this discussion and explain that raw message deliver needs to be disabled to fully trace SNS. But that seems like a fine tradeoff to unblock you folks!