Closed robrap closed 1 month ago
function_trace
implemented for DD monitoring in edx-django-utils? See https://github.com/openedx/event-bus-kafka/blob/main/edx_event_bus_kafka/internal/consumer.py@dianakhuang: ~~This comment should probably be a new separate ticket, but adding it here to start. I noticed that the error in logs for "failed to send, dropping 1 traces to intake at unix:///var/run/datadog/apm.socket/v0.5/traces after 3 retries" seems to be hitting our kafka consumers. It may be hitting some other workers, but not sure if we just have inconsistent naming. I'm wondering if this has anything to do with the long-running infinite loop, and if we need to clean up the trace, like we clean up the db connection, etc.? I'm adding this here while you are thinking about this, but as I noted, it might need a separate ticket and separate DD support ticket.~~
UPDATE: This has been moved to a new ticket: https://github.com/edx/edx-arch-experiments/issues/736
@dianakhuang:
operation_name:kafka.consume
spans, that are probably already available as the current span. consumer.consume
(to go with the existing kafka.consume
) in the PR description.UPDATE: Added point 3 as well.
What we want:
Ideas:
@kafka.received_message:True
) that seems like it should be making requests or mysql spans, and ask DD support why these spans don't appear in the trace.Questions:
Note: We may want to retain 100% of spans with the newly defined operation_name. We'll see.
Datadog Support confirms that there is no automatic support for connecting the producer's trace to the spans that come out of the consumer's work. However, we can implement this ourselves if we need it:
Confirming that the functionality difference you've described between NR and DD currently does not exist for us OOTB, and would require some custom code to implement. One of our engineering folks provided this example, using the ddtrace propagator class, and using a manual span to house any post-message processing:
from ddtrace import tracer, config
from ddtrace.propagation.http import HTTPPropagator as Propagator
msg = consumer.poll()
ctx = None
if msg is not None and msg.headers():
# Extract the distributed context from message headers
ctx = Propagator.extract(dict(msg.headers()))
with tracer.start_span(
name="kafka-message-processing", # or whatever name they want from the manual span
service="their service name", # match their main service name
child_of=ctx if ctx is not None else tracer.context_provider.active(),
activate=True
):
# do any db or other operations that you want included in the distributed context
db.execute()
One important note here: You'll want to ensure for both producer and consumer services, the following environment variable has been set: DD_KAFKA_PROPAGATION_ENABLED=true. Using this, the trace should include both producer and consumer spans as well as later operation spans.
(It would probably be more appropriate for us to use Span Links but those are only available via the OpenTelemetry integration.)
^ Converted that distributed tracing info to its own ticket: https://github.com/edx/edx-arch-experiments/issues/758
Review and possibly update the following docs:
This ticket is for determining what we want and need from Kafka APM monitoring, and implementing or spinning off appropriate tickets.
Tasks:
Notes:
operation_name
value in place ofdjango.request
(for example). Something likeconsumer.consume
(to go with kafka.consume)?OtherTransaction/Message/Kafka/Topic/Named/prod-course-catalog-info-changed
service:kafka
operation_name:kafka.consume
spans that have limited information.DD_KAFKA_SERVICE
could be used to changeservice:kafka
from its default.