edx / edx-arch-experiments

A plugin to include applications under development by the architecture team at edx
GNU Affero General Public License v3.0
0 stars 3 forks source link

Ensure Kafka Event Bus APM monitoring #658

Closed robrap closed 1 month ago

robrap commented 4 months ago

This ticket is for determining what we want and need from Kafka APM monitoring, and implementing or spinning off appropriate tickets.

Tasks:

Notes:

robrap commented 4 months ago
robrap commented 3 months ago
robrap commented 2 months ago

@dianakhuang: ~~This comment should probably be a new separate ticket, but adding it here to start. I noticed that the error in logs for "failed to send, dropping 1 traces to intake at unix:///var/run/datadog/apm.socket/v0.5/traces after 3 retries" seems to be hitting our kafka consumers. It may be hitting some other workers, but not sure if we just have inconsistent naming. I'm wondering if this has anything to do with the long-running infinite loop, and if we need to clean up the trace, like we clean up the db connection, etc.? I'm adding this here while you are thinking about this, but as I noted, it might need a separate ticket and separate DD support ticket.~~

UPDATE: This has been moved to a new ticket: https://github.com/edx/edx-arch-experiments/issues/736

robrap commented 2 months ago

@dianakhuang:

  1. I moved most of the service naming questions to other tickets.
  2. However, one question for this ticket is whether the new spans you will be creating would be root spans, or if they should be child spans of the operation_name:kafka.consume spans, that are probably already available as the current span.
  3. I updated the proposed operation name to consumer.consume (to go with the existing kafka.consume) in the PR description.

UPDATE: Added point 3 as well.

robrap commented 2 months ago

What we want:

Ideas:

Questions:

robrap commented 2 months ago

Note: We may want to retain 100% of spans with the newly defined operation_name. We'll see.

timmc-edx commented 1 month ago

Datadog Support confirms that there is no automatic support for connecting the producer's trace to the spans that come out of the consumer's work. However, we can implement this ourselves if we need it:

Confirming that the functionality difference you've described between NR and DD currently does not exist for us OOTB, and would require some custom code to implement. One of our engineering folks provided this example, using the ddtrace propagator class, and using a manual span to house any post-message processing:

from ddtrace import tracer, config
from ddtrace.propagation.http import HTTPPropagator as Propagator

msg = consumer.poll()

ctx = None
if msg is not None and msg.headers():
    # Extract the distributed context from message headers
    ctx = Propagator.extract(dict(msg.headers()))
with tracer.start_span(
    name="kafka-message-processing", # or whatever name they want from the manual span
    service="their service name", # match their main service name
    child_of=ctx if ctx is not None else tracer.context_provider.active(),
    activate=True
):
    # do any db or other operations that you want included in the distributed context
    db.execute()

One important note here: You'll want to ensure for both producer and consumer services, the following environment variable has been set: DD_KAFKA_PROPAGATION_ENABLED=true. Using this, the trace should include both producer and consumer spans as well as later operation spans.

(It would probably be more appropriate for us to use Span Links but those are only available via the OpenTelemetry integration.)

timmc-edx commented 1 month ago

^ Converted that distributed tracing info to its own ticket: https://github.com/edx/edx-arch-experiments/issues/758

robrap commented 1 month ago

Review and possibly update the following docs: