edx / edx-arch-experiments

A plugin to include applications under development by the architecture team at edx
GNU Affero General Public License v3.0
0 stars 3 forks source link

Disabling NR APM causes trace concatenation in Datadog #692

Closed timmc-edx closed 1 month ago

timmc-edx commented 5 months ago

Ultimately, this ticket is for disabling New Relic APM across edxapp. We ran into trace related issues in DD when first attempting to disable NR APM. We later caused the same issue in Edge when simply disabling NR Tracing.

This bug has been observed in edxapp (LMS and CMS), enterprise-catalog, and registrar. It can be identified by searching for spans matching operation_name:django.request -@_top_level:*.

Acceptance criteria

Things we have already tried

These should be checked off once they have already been either reverted or made permanent:

When we disabled NR APM in edxapp on June 6 we observed two anomalies with traces:

However, we believe the actual traffic was unchanged. This is corroborated by the Django hit metrics remaining steady, as seen in the Service Catalog. We cannot find any relevant code or config changes that would have been deployed around that time.

Our current understanding is that the majority of Django web requests that are traced are not recorded as service entry spans, but are instead parented to a different trace. This causes several problems:

Image

We can also reproduce this issue by setting "Tracing type: None" in the application settings in NR (usually set to Distributed Tracing).

Links

robrap commented 5 months ago

Other notes:

robrap commented 5 months ago

Additional thoughts, questions, ideas:

robrap commented 5 months ago

Currently, we're investigating if using a NR Free Tier account for edxapp is enough to get DD traces working.

Other possibilities may include trying to get tracing (or APM) disabled everywhere in Edge. This includes where Spans were found in the last day:

prod-edge-edxapp-cms 9.28 M
prod-edge-analytics-api 2.35 M
prod-edge-notes 2.3 M
prod-edge-edxapp-workers-lms 755 k
prod-edge-forum 675 k
prod-edge-edxapp-workers-cms 188 k
robrap commented 5 months ago

[idea] We might want 3 modes for our hacked NR agent:

  1. Send no data to NR, but add tracing info (fixes DD traces).
  2. Send no data to NR, and fake a bad account id so we don't even take a performance hit (breaks DD traces).
  3. Send data to NR (costs money, but if we want to verify anything temporarily, we can.
timmc-edx commented 1 month ago

We'll leave the datadog_diagnostics cleanup task for about a week, for cleanup 2024-10-14 or later. Moving to blocked for until then.