Allow distributed tracing for non-API Gateway trigger type

lageorgem commented 1 year ago

Is your feature request related to a problem? Please describe. Recently, behaviour was added to support span links whenever the traceparent was present in the SQS message attribute. This feature is nice, however in my use case, I have multiple microservices communicating between each other via SQS.

The AWS API Gateway trigger type can receive the traceparent and tracestate via HTTP headers injected by the Agent on the upstream service. This allows services communicating via HTTP calls to maintain a clear linear execution between microservices while also keeping all logs in one place.

My use case is that I do something similar, but the inter-microservice communication is done via SQS, as it provides a cheaper alternative to API Gateway for internal communication, while keeping it unexposed to the internet.

Describe the solution you'd like I believe an option that allows the traceparent and tracestate MessageAttributes to be used when the agent starts the transaction via the Lambda instrumentation would be useful.

The benefits would include having the entire chained execution in one APM trace, so that I can view all logs in one place and not have to navigate through linked spans, as well as proper Service Maps built for multiple microservices that are part of the same execution.

Describe alternatives you've considered Previously the workaround for this was to have a function before the lambda instrumentation that adds the traceparent from MessageAttributes into a headers object. This behaviour is no longer possible since changes to the agent have decoupled transaction set-up specific logic from the different trigger types. Attempting to fake an API gateway trigger will result in bad transaction information.

trentm commented 1 year ago

Hi, @lageorgem.

Previously the workaround for this was ...

This was back when one had to call apm.lambda(...) manually? Or something more recent?

Recently, behaviour was added to support span links whenever the traceparent was present in the SQS message attribute. This feature is nice, however...

Understood. Ideally navigating traces when there are span links in play would be as featureful and natural as when a distributed trace only involves direct single-trace single-parent relationships. But that isn't the case in our UI right now.

I believe an option that allows the traceparent and tracestate MessageAttributes to be used when the agent starts the transaction via the Lambda instrumentation would be useful.

We (the APM agent team at Elastic) did talk about doing this somewhat recently -- special casing the case when the Lambda function event object holds a single SQS (or SNS) message. We ultimately decided not to special-case that because it would mean different tracing behaviour based on timing. If many SQS events were enqueued rapidly enough, then dispatched Lambda invocations could/would have multiple events. Should a traceparent only be used for the first SQS event?

https://docs.aws.amazon.com/lambda/latest/dg/with-sqs.html

By default, Lambda polls up to 10 messages in your queue at once and sends that batch to your function.

Do you configure your SQS Lambda triggers to only have a maximum "batch size" of one?

lageorgem commented 1 year ago

Hello, @trentm

This was back when one had to call apm.lambda(...) manually? Or something more recent?

This was done in v3.35.0, where the possibility of using MessageAttributes trace data at transaction start was removed.

Do you configure your SQS Lambda triggers to only have a maximum "batch size" of one?

Yes, in my use case, there are 4 lambdas executing in a pipeline-like manner. One message per execution, an execution ends by sending an SQS message for the next one. This is why it's important to have linearity in the traces.

I'd also like to have the service maps generated correctly, and this prevents me from doing so, since source maps do not understand one service is triggered by the SQS message from another

trentm commented 1 year ago

For this to work for you the way you would like would require:

Us re-adding Lambda instrumentation functionality to specially handle events that have a single SQS record. If adding this, we'd make it opt-in. I'll discuss this with the team.
Me finishing https://github.com/elastic/apm-agent-nodejs/pull/3044 (currently in progress) to get trace-context propagated in the outgoing SQS messages.
Also possibly https://github.com/elastic/apm-agent-nodejs/issues/2957 if you are using version 3 of the AWS client (i.e. @aws-sdk/client-sqs.

lageorgem commented 1 year ago

Understood. I am wondering whether it is possible to replace the agent.startTransaction function with a middleman function that calls the original one with the traceparent and tracestate, as previously.

It's worth noting I am currently using this hacky piece of code to append trace-context in SQS MessageAttributes manually. This seems to enable span links with SQS.

export const appendTraces = <T>(
    obj: T & Partial<WithTraces>
): obj is T & WithTraces => {
    obj.traceparent = typedAPM.currentTransaction?.traceparent
    obj.tracestate =
        typedAPM.currentTransaction?._context?.tracestate?.toString()

    return true
}

From what I see in the lambda instrumentation, the function elasticApmAwsLambda receives the agent as parameter, so I am not entirely certain I actually have access to that object anymore since using the Agent Lambda layer. Can you suggest any hack for this?

trentm commented 1 year ago

I am wondering whether it is possible to replace the agent.startTransaction function with a middleman function that calls the original one with the traceparent and tracestate, as previously.

Heh, you want to monkey-patch/instrument the instrumentor. :) I took a quick stab at this. Because agent.startTransaction(...) is executed before your instrumented Lambda handler function, there isn't a way to pass in the current event value with the SQS message attributes to the monkey-patched startTransaction().

However, another -- also hacky -- way is to change the trace-context information on the Transaction object after it is created. This is generally unsupported, but I think may work.

const apm = require('elastic-apm-node')
const TraceContext = require('elastic-apm-node/lib/tracecontext')

function getStringMessageAttribute(attrs, name) {
  const keys = Object.keys(attrs)
  const nameLower = name.toLowerCase()
  for (let i = 0; i < keys.length; i++) {
    const k = keys[i]
    if (attrs[k].dataType !== 'String') {
      continue
    }
    const kLower = k.toLowerCase()
    if (kLower === nameLower) {
      return attrs[k].stringValue
    }
  }
}

exports.handler = async function myPlayHandler (event, context) {
  const traceparent = getStringMessageAttribute(event.Records[0].messageAttributes, 'traceparent')
  const tracestate = getStringMessageAttribute(event.Records[0].messageAttributes, 'tracestate')
  if (traceparent) {
    // Hackily replace the current transaction's internal attribute holding its
    // trace-context. This isn't meant to change after Transaction creation,
    // but might work for now.
    apm.currentTransaction._context = TraceContext.startOrResume(traceparent, apm._conf, tracestate)
  }

  // ...
  return {
    statusCode: 200,
    body: 'hi from myPlayHandler'
  }
}

Perhaps this is a hacky workaround you could use for now.

I did not test this at all thoroughly. I'd be interested to know if it works for you.

lageorgem commented 1 year ago

Hello again,

I have tested it out, and it works in linking the services correctly, but it seems the subsequent spans inside the services don't show up anymore. Do you know why that may be?

trentm commented 1 year ago

@lageorgem As a simple test to see if a child span created after the hacking of apm.currentTransaction._context worked, I tried this:

const apm = require('elastic-apm-node')
const TraceContext = require('elastic-apm-node/lib/tracecontext')
const {setTimeout: setTimeoutP} = require('timers/promises');

function getStringMessageAttribute(attrs, name) {
  const keys = Object.keys(attrs)
  const nameLower = name.toLowerCase()
  for (let i = 0; i < keys.length; i++) {
    const k = keys[i]
    if (attrs[k].dataType !== 'String') {
      continue
    }
    const kLower = k.toLowerCase()
    if (kLower === nameLower) {
      return attrs[k].stringValue
    }
  }
}

exports.handler = async function myPlayHandler (event, context) {
  const traceparent = getStringMessageAttribute(event.Records[0].messageAttributes, 'traceparent')
  const tracestate = getStringMessageAttribute(event.Records[0].messageAttributes, 'tracestate')
  if (traceparent) {
    // Hackily replace the current transaction's internal attribute holding its
    // trace-context. This isn't meant to change after Transaction creation,
    // but might work for now.
    apm.currentTransaction._context = TraceContext.startOrResume(traceparent, apm._conf, tracestate)
  }

  var span = apm.startSpan('mySpan')  // <----  Adding a span here.
  await setTimeoutP(200)
  span.end()

  return {
    statusCode: 200,
    body: 'hi from myPlayHandler'
  }
}

Then I called that lambda function via an SQS trigger with an SQS message which included this message attribute:

  "traceparent": {
    "DataType": "String",
    "StringValue": "00-12345678901234567890123456789012-1234567890123456-01"
  },

The result was a trace that includes mySpan as a child of the transaction:

and the transaction has the expected parent.id and trace.id from the sent traceparent.

Granted this is a simple single-transaction test.

Can we dig into what the subsequent spans inside the services means in your case? What instrumentation is creating these spans? Manual instrumentation via the Elastic APM agent API? Automatic APM agent instrumentation? If so, instrumentation of which library?

Are you doing the hacking of apm.currentTransaction._context immediately at the start of your handler function before any other processing is going on?

lageorgem commented 1 year ago

Hello,

I've realised I did some APM actions before doing the snippet you provided. I am now executing your code at the very start of the function and it works properly.

Should I keep this thread open?

trentm commented 1 year ago

@lageorgem Yes, please, for the time being let's keep this open. This is the motivating use case for discussion on a supported way to handle trace-propagation based on the first message in a received batch.

elastic / apm-agent-nodejs

Allow distributed tracing for non-API Gateway trigger type #3052