edx / edx-arch-experiments

A plugin to include applications under development by the architecture team at edx
GNU Affero General Public License v3.0
0 stars 3 forks source link

Disabling NR APM causes trace concatenation in Datadog #692

Open timmc-edx opened 1 week ago

timmc-edx commented 1 week ago

Ultimately, this ticket is for disabling New Relic APM across edxapp. We ran into trace related issues in DD when first attempting to disable NR APM. We later caused the same issue in Edge when simply disabling NR Tracing.

Acceptance criteria

Details

When we disabled NR APM in edxapp on June 6 we observed two anomalies with traces:

However, we believe the actual traffic was unchanged. This is corroborated by the Django hit metrics remaining steady, as seen in the Service Catalog. We cannot find any relevant code or config changes that would have been deployed around that time.

Our current understanding is that the majority of Django web requests that are traced are not recorded as service entry spans, but are instead parented to a different trace. This causes several problems:

Image

We can also reproduce this issue by setting "Tracing type: None" in the application settings in NR (usually set to Distributed Tracing).

Links

robrap commented 1 week ago

Other notes:

robrap commented 1 week ago

Additional thoughts, questions, ideas:

robrap commented 5 days ago

Currently, we're investigating if using a NR Free Tier account for edxapp is enough to get DD traces working.

Other possibilities may include trying to get tracing (or APM) disabled everywhere in Edge. This includes where Spans were found in the last day:

prod-edge-edxapp-cms 9.28 M
prod-edge-analytics-api 2.35 M
prod-edge-notes 2.3 M
prod-edge-edxapp-workers-lms 755 k
prod-edge-forum 675 k
prod-edge-edxapp-workers-cms 188 k
robrap commented 3 days ago

[idea] We might want 3 modes for our hacked NR agent:

  1. Send no data to NR, but add tracing info (fixes DD traces).
  2. Send no data to NR, and fake a bad account id so we don't even take a performance hit (breaks DD traces).
  3. Send data to NR (costs money, but if we want to verify anything temporarily, we can.