Spec adding span links for instrumentation of messaging systems

trentm commented 2 years ago

Specify that span links should be used on "messaging" spans and on "messaging" transactions that cover a batch of messages, when messages include trace context.
Update Lambda instrumentation for SQS/SNS to no longer special-case a batch with a single message.
Document which message metadata mechanism to use for propagating trace-context for each messaging system.

Closes: #606

Notes for reviewers

This adds links to "span-links.md" that will be added in https://github.com/elastic/apm/pull/600

This is a new spec or a bigger enhancement

May the instrumentation collect sensitive information, such as secrets or PII (ex. in headers)?
- [ ] Yes
- [ ] Add a section to the spec how agents should apply sanitization (such as sanitize_field_names)
- [ ] No
- [ ] Why?
- [x] n/a
[x] Link to meta issue: #605
[x] Create PR as draft
[x] Approval by at least one other agent
[x] Mark as Ready for Review (automatically requests reviews from all agents and PM via CODEOWNERS)
- Remove PM from reviewers if impact on product is negligible
- Remove agents from reviewers if the change is not relevant for them
[x] Approved by at least 2 agents + PM (if relevant)
[x] Merge after 7 days passed without objections
[x] Create implementation issues (ideally add a milestone) - had already been done
[x] Create a status table and add it to the meta issue - had already been done

apmmachine commented 2 years ago

:green_heart: Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

#### Build stats * Start Time: 2022-04-29T04:16:00.081+0000 * Duration: 3 min 21 sec

trentm commented 2 years ago

[@eyalkoren]

Any spec dealing with messaging should emphasize the fact that whenever there's a chance, if the auto-instrumentation can capture the single-message handling, that is highly preferable.

For example, the Java SQS event records accessor returns a java.util.List<SQSEvent.SQSMessage>. Disregarding the required effort (because I didn't investigate that yet), we may be able to wrap this list and use its iterator to create a transaction per-message, like we do for Kafka.

I didn't change the existing language in the messaging spec that describes a preference for having a transaction-per-message when possible:

When active or passive message reception results in receiving a batch of messages, a messaging transaction SHOULD be started and ended for the processing of each message in the batch, if possible. ...
I did effectively remove mention of having a transaction per message (and possibly continuing a trace, rather than a span link) in the Lambda instrumentation spec.

For SQS triggers, this is because (a) the lambda invocation is already a transaction and (b) AFAIK there is no way for a Lambda invocation to know if the SQS trigger is explicitly set to a batch size of 1 -- I don't think a Lambda function gets any context about that. So I think at best this could be configurable for users: "yes my SQS trigger is batchSize=1, or it is batchSize >1 but I want the special casing to get nice distributed traces".

For SNS, my limited understanding (inferring from AWS docs and some googling) is that there will always be just one message in the Lambda trigger. However my impression also was that SNS is less "important" to us than the SQS instrumentation/customer-usage so I wasn't sure about having the spec depend on that. I may have misinterpreted this "SNS is less important".

In addition, a span link should be added to message receiving spans, if the receive/read operation returns a message that has a trace context attached to it.

Yes, that is covered here: https://github.com/elastic/apm/pull/616/files#diff-46364b78a8e8440114bb8b3bfec19c7088b1ecef27f80eeeb3b37f908f96d3cbR71-R92

eyalkoren commented 2 years ago

For SQS triggers, this is because (a) the lambda invocation is already a transaction...

Sorry for my ignorance, my experience with Lambda is only reviewing and testing the Java agent implementation - why is that? Why can't we use the lambda invocation as the instrumentation trigger, but create multiple transactions? How is that different from having a Kafka client that periodically reads batches of records and we create a transaction for each record-handling rather than the record-read operation?

For SNS, my limited understanding (inferring from AWS docs and some googling) is that there will always be just one message in the Lambda trigger

You know more than I do, but if that's the then it definitely worth it's dedicated section in my opinion. Again, span links are great, but for now they are not a replacement for trace propagation. It may be the case if we learn how to use them to create service maps and draw a complete trace.

I am insisting because IIUC, the current SQS/SNS effectively says that such environments will not support full traces and service maps. If I am not correct - great. Otherwise, let's just agree that we are OK with that. I understand it is impossible or not worth the effort in some cases, but the spec is for all cases. BTW, it may not worth the effort in Java as well, I still didn't really look into that...

trentm commented 2 years ago

SQS

(If it helps for reference, here are the AWS docs for SQS-triggering of Lambda functions: https://docs.aws.amazon.com/lambda/latest/dg/with-sqs.html)

[@eyalkoren]

Why can't we use the [SQS-triggered] lambda invocation as the instrumentation trigger, but create multiple transactions?

My turn for ignorance. My understanding is that for a Java lambda handler you'll have:

public class ProcessSQSMessageBatch implements RequestHandler<SQSEvent, SQSBatchResponse> {
    @Override
    public SQSBatchResponse handleRequest(SQSEvent sqsEvent, Context context) {
...

where there is a sqsEvent.getRecords() to get the individual messages. Could the Java agent instrument that getRecords() and "create a transaction for each record-handling"? If so, then I should update this spec to discuss doing that.

In Node.js-land, there is just an event object that has a event.records field that is a plain JS Array instance. There is no way to instrument handling of individual elements of that array. We had briefly discussed the possibility of docs and/or utilities in the APM agents for users to manually instrument their code to get trace propagation. (An earlier example of that is https://www.elastic.co/guide/en/apm/agent/nodejs/current/message-queues.html#message-queues-distributed-tracing which was added when SQS support was first added to the Node.js APM agent.)

SNS

For SNS, my limited understanding (inferring from AWS docs and some googling) is that there will always be just one message in the Lambda trigger

My main concern here is that I'm not 100% sure I'm correct here. I think I'm right. My understanding is from https://stackoverflow.com/a/33692863/14444044 and this entry in https://aws.amazon.com/sns/faqs/

Q: Will a notification contain more than one message?

No, all notification messages will contain a single published message.

@felixbarny @AlexanderWert Do you have an opinion on my adding back special-case handling of SNS Lambda triggers with a single message, based on the understanding (possibly wrong) that they always only ever have a single message. This would allow us to have full traces for SNS.

eyalkoren commented 2 years ago

Could the Java agent instrument that getRecords() and "create a transaction for each record-handling"?

That is exactly the theory 😅 . Since it returns a java.util.List, we should be able to do exactly what we do with the Kafka client - wrapping the returned list to trace message iteration.

In Node.js-land, there is just an event object that has a event.records field that is a plain JS Array instance.

I realize that it's way more complex in Node, I had this discussion with @astorm before. My proposal is not to enforce that through the spec, but to:

make it clear that this is the preferable way wherever the effort is "reasonable" (don't ask me to quantify that)
propose ways to achieve that otherwise (e.g. document how to use the public API or framework APIs like interceptors or the like if there are such)

felixbarny commented 2 years ago

Again, span links are great, but for now they are not a replacement for trace propagation. It may be the case if we learn how to use them to create service maps and draw a complete trace.

We don't do it today, but there's nothing stopping us from considering span links in the service map. Span links will also be integrated in the trace waterfall so that you can navigate from the producer of a message to the trace that's batch consuming the message.

I don't think we should base general instrumentation decisions on current deficiencies of how we currently handle span links in the UI.

Having said that, I do agree that continuing the same trace is preferable where it's easy to achieve. For onMessage-style callbacks that have a single message in them it's a no-brainer. For more complex things like a list of messages that could be individually traced via an instrumented iterator, I'm torn. While it's great in most cases, there can be issues when some assumptions on how the list is consumed don't hold true. For example, if the list is iterated over multiple times or in different threads.

Sorry for my ignorance, my experience with Lambda is only reviewing and testing the Java agent implementation - why is that? Why can't we use the lambda invocation as the instrumentation trigger, but create multiple transactions? How is that different from having a Kafka client that periodically reads batches of records and we create a transaction for each record-handling rather than the record-read operation?

For Lambda specifically, I think there should be exactly one transaction per lambda invocation. Things like the cold start rate also rely on this. So while it's possible do that in theory, I don't think we should do it.

I am insisting because IIUC, the current SQS/SNS effectively says that such environments will not support full traces and service maps. If I am not correct - great. Otherwise, let's just agree that we are OK with that.

Given the improvements via span links, I'm ok with that. I realize that span links are not 100% on par with a continued trace but we can get pretty close.

@felixbarny @AlexanderWert Do you have an opinion on my adding back special-case handling of SNS Lambda triggers with a single message, based on the understanding (possibly wrong) that they always only ever have a single message. This would allow us to have full traces for SNS.

Another tricky case. As the API gives as a list or an array of SNS messages, it feels like relying on an implementation detail. Can we start with span links to make the implementation analogous to SQS and adjust later on, based on feedback?

Most of the things discussed here seem like something we can always change later on. I don't see having the perfect answer right now as critical. I'd rather the the initial spec in soonish and discuss enhancements separately.

@eyalkoren given the additional context, which are the points you'd want to have resolved before merging this and what can we discuss afterwards?

eyalkoren commented 2 years ago

I realize that span links are not 100% on par with a continued trace but we can get pretty close.

To me, supporting a technology with or without distributed tracing and service maps is not close at all. I would even say that distributed tracing is the single most important feature when comparing APM to other monitoring tools. If you think we can support distributed traces and service maps through span links with reasonable effort, I propose to prioritize that.

@eyalkoren given the additional context, which are the points you'd want to have resolved before merging this and what can we discuss afterwards?

I don't feel there's anything left to discuss, please go ahead and merge.

trentm commented 2 years ago

Thanks very much for the discussion, Eyal, Felix.

Can we start with span links to make the implementation analogous to SQS and adjust later on, based on feedback?

Sounds good.

AlexanderWert commented 2 years ago

@trentm are there any blockers for this to be merged?

trentm commented 2 years ago

@AlexanderWert No blockers. I ran out of time yesterday. :) Merging this now.

elastic / apm