aspecto-io / opentelemetry-ext-js

js extensions for the open-telemetry project
https://www.aspecto.io
Apache License 2.0
172 stars 40 forks source link

Question: aws-sdk with sqs-consumer #182

Closed arperrin closed 3 years ago

arperrin commented 3 years ago

I have 2 services, a producer and consumer communicating over sqs.

I am using https://github.com/aspecto-io/opentelemetry-ext-js/tree/master/packages/instrumentation-aws-sdk as well as https://www.npmjs.com/package/sqs-consumer, but I am not loading instrumentation-aws-sdk within sqs-consumer, I’m loading it through the consumer service (that is using sqs-consumer)

I am getting spans to export. aws-sdk instrumentation is clearly somewhat working, or I would not expect to see any spans at all. But I am losing context propagation. The producer is correctly placing traceparent in the message attributes.

I’m posting here as a sanity check:

  1. Is this expected behavior?
  2. I cannot be the first to encounter this, as sqs-consumer is a popular library. Has anyone else using this combination of libraries already solved this problem? I am asking around before I go and write an sqs-consumer otel instrumentation library.

Version info: "aws-sdk": "^2.699.0", "sqs-consumer": "^5.4.0", "opentelemetry-instrumentation-aws-sdk": "^0.24.2",

blumamir commented 3 years ago

Hi @arperrin , Just checked with the sqs-consumer package and things are seemed to be working as expected. What do you mean by "But I am losing context propagation"? are you expecting the message processing spans to have the same trace id as the publishing span? If so, then this is not how opentelemetry works. you can read more about it in the semantic convention for messaging systems. Specifically, look at the example for batch receive which is how SQS operates (each receiveMessage call to AWS is actually receiving a batch of messages, most likely each with different trace context)

The processing spans create a link to reference the relevant "sendMessage" span.

If it's not working for your app, please add here a short snippet of the relevant part of your code, and the spans that are created. Thanks, Amir

arperrin commented 3 years ago

I am expecting the receive span to have the same parentId, similar to amqp context propagation. I have services A, B, and C. A sends a message to B via amqp. B posts a message to sqs, C receives it.

If I query the spans in a telemetry platform (in my case, Honeycomb) I can view the lifecycle of a request from A to B, but then I lose context to C. If I search for C by name, I can see spans, but they have no parent. This makes it more difficult for me to keep track of the trace from A to C.

If the consumer does not use the parent Id, what is the point of the traceparent placed in the message attributes by the producer? Is that what ends up in the link?

What I am attempting to do is to add span attributes to A, B, and C, to correlate the state of the system at each step where certain conditions are met (e.g. query all traces where Service B shows a value of "123" for attribute X, and Service C shows a value of "null" for attribute Y). This is something that is possible, according to conversations with Honeycomb, Lightstep, and Datadog. I have yet to see this actually working on any of those platforms, however.

On Tue, Sep 14, 2021 at 7:30 AM Amir Blum @.***> wrote:

Hi @arperrin https://github.com/arperrin , Just checked with the sqs-consumer package and things are seemed to be working as expected. What do you mean by "But I am losing context propagation"? are you expecting the message processing spans to have the same trace id as the publishing span? If so, then this is not how opentelemetry works. you can read more about it in the semantic convention for messaging systems https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/semantic_conventions/messaging.md#messaging-systems. Specifically, look at the example for batch receive https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/semantic_conventions/messaging.md#batch-receiving which is how SQS operates (each receiveMessage call to AWS is actually receiving a batch of messages, most likely each with different trace context)

The processing spans create a link https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/api.md#specifying-links to reference the relevant "sendMessage" span.

If it's not working for your app, please add here a short snippet of the relevant part of your code, and the spans that are created. Thanks, Amir

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/aspecto-io/opentelemetry-ext-js/issues/182#issuecomment-919063572, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABSCGNH4IRUBPLCHHEZH3PTUB4W6BANCNFSM5D6OQZMQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

blumamir commented 3 years ago

I understand your expectation, and at first sight, the current implementation which follows the opentelemetry specification might look weird indeed.

The difference between amqp and SQS is the interface for consuming messages. amqplib package api and protocol implementation does not support batch consume, which makes life much easier. For SQS however, the receiver mechanism is long polling, and receiving messages in batches of up to 10. this makes everything much more complex. If we were to connect the sqs message processing span to the publish span, then we would break the link to the rpc call to AWS, as well as not be able to express any operations being executed at the batch level. while it can improve some traces (like the use case you stated above), it will make it less trivial for user users with different needs.

There is a new opentelemetry SIG that discusses messaging systems semantics. you can read more about it here and I'll be happy to see you there. At these meetings, we will discuss the various options and the tradeoffs both for end-users with different expectations and for instrumentation implementations and processing pipelines.

If the consumer does not use the parent Id, what is the point of the traceparent placed in the message attributes by the producer? Is that what ends up in the link?

Yes, the context is extracted and added to the link.

This is something that is possible, according to conversations with Honeycomb, Lightstep, and Datadog

I am not aware of how these platforms suggest doing it, but will be happy to learn. Feel free to share here your findings.