Ensure Kafka Event Bus APM monitoring

robrap commented 4 months ago

This ticket is for determining what we want and need from Kafka APM monitoring, and implementing or spinning off appropriate tickets.

Tasks:

[X] Open DD Support ticket for missing functionality. Maybe point to New Relic instrumentation code that does this?
[x] Add missing data to DD. This may take manual span instrumentation, rather than waiting on DD support ticket.
[x] Complete Kafka related monitors (for new data) and runbooks for platform-arch-bom-event-bus-safety-net. (Log based monitors exist, but are imperfect.)
[x] Implement dashboard: Event Bus Kafka overview
[x] @robrap: Move naming convention questions that are not needed as part of this ticket to a new ticket. (See https://github.com/edx/edx-arch-experiments/issues/740)

Notes:

Do we need to come up with an operation_name value in place of django.request (for example). Something like consumer.consume (to go with kafka.consume)?
In New Relic, there are Transactions of type Message for Kafka.
- See edx-prod-discovery example transactions.
- Example transaction name: OtherTransaction/Message/Kafka/Topic/Named/prod-course-catalog-info-changed
- Provides actual message processing time details, with Trace details.
- So far, I've been unable to find this type of information in Datadog.
Here is a doc for ddtrace Kafka integration.
- It claims it will work automatically, which may only create the service:kafka operation_name:kafka.consume spans that have limited information.
- DD_KAFKA_SERVICE could be used to change service:kafka from its default.
- The question of remapping this service as been moved to https://github.com/edx/edx-arch-experiments/issues/737.
Private ticket for Datadog Support regarding orphaned spans in Kafka consumers: https://help.datadoghq.com/hc/en-us/requests/1789792

robrap commented 4 months ago

I create a Slack topic with DD here: https://twou.slack.com/archives/C06QEAJHLC9/p1716589577574909?thread_ts=1716585225.853719&cid=C06QEAJHLC9 (in the private external channel).
Maybe we'll create a support ticket.

robrap commented 3 months ago

Is function_trace implemented for DD monitoring in edx-django-utils? See https://github.com/openedx/event-bus-kafka/blob/main/edx_event_bus_kafka/internal/consumer.py
Do we still need something that starts the parent spans?

robrap commented 2 months ago

@dianakhuang: ~~This comment should probably be a new separate ticket, but adding it here to start. I noticed that the error in logs for "failed to send, dropping 1 traces to intake at unix:///var/run/datadog/apm.socket/v0.5/traces after 3 retries" seems to be hitting our kafka consumers. It may be hitting some other workers, but not sure if we just have inconsistent naming. I'm wondering if this has anything to do with the long-running infinite loop, and if we need to clean up the trace, like we clean up the db connection, etc.? I'm adding this here while you are thinking about this, but as I noted, it might need a separate ticket and separate DD support ticket.~~

UPDATE: This has been moved to a new ticket: https://github.com/edx/edx-arch-experiments/issues/736

robrap commented 2 months ago

@dianakhuang:

I moved most of the service naming questions to other tickets.
However, one question for this ticket is whether the new spans you will be creating would be root spans, or if they should be child spans of the operation_name:kafka.consume spans, that are probably already available as the current span.
I updated the proposed operation name to consumer.consume (to go with the existing kafka.consume) in the PR description.

UPDATE: Added point 3 as well.

robrap commented 2 months ago

What we want:

Processing time of a message
Spans for mysql, cache, etc. that happen during the message processing
Span tags with topic, etc. in the root span.

Ideas:

Find an example kafka consume span (@kafka.received_message:True) that seems like it should be making requests or mysql spans, and ask DD support why these spans don't appear in the trace.
Check DD trace code for what it does in Kafka.

Questions:

Does the consume span close before getting the full processing time?

robrap commented 2 months ago

Note: We may want to retain 100% of spans with the newly defined operation_name. We'll see.

timmc-edx commented 1 month ago

Datadog Support confirms that there is no automatic support for connecting the producer's trace to the spans that come out of the consumer's work. However, we can implement this ourselves if we need it:

Confirming that the functionality difference you've described between NR and DD currently does not exist for us OOTB, and would require some custom code to implement. One of our engineering folks provided this example, using the ddtrace propagator class, and using a manual span to house any post-message processing:

from ddtrace import tracer, config
from ddtrace.propagation.http import HTTPPropagator as Propagator

msg = consumer.poll()

ctx = None
if msg is not None and msg.headers():
    # Extract the distributed context from message headers
    ctx = Propagator.extract(dict(msg.headers()))
with tracer.start_span(
    name="kafka-message-processing", # or whatever name they want from the manual span
    service="their service name", # match their main service name
    child_of=ctx if ctx is not None else tracer.context_provider.active(),
    activate=True
):
    # do any db or other operations that you want included in the distributed context
    db.execute()

One important note here: You'll want to ensure for both producer and consumer services, the following environment variable has been set: DD_KAFKA_PROPAGATION_ENABLED=true. Using this, the trace should include both producer and consumer spans as well as later operation spans.

(It would probably be more appropriate for us to use Span Links but those are only available via the OpenTelemetry integration.)

timmc-edx commented 1 month ago

^ Converted that distributed tracing info to its own ticket: https://github.com/edx/edx-arch-experiments/issues/758

robrap commented 1 month ago

Review and possibly update the following docs:

edx / edx-arch-experiments

Ensure Kafka Event Bus APM monitoring #658