Closed timmc-edx closed 1 month ago
Other notes:
6f17017-5957
Additional thoughts, questions, ideas:
Currently, we're investigating if using a NR Free Tier account for edxapp is enough to get DD traces working.
Other possibilities may include trying to get tracing (or APM) disabled everywhere in Edge. This includes where Spans were found in the last day:
prod-edge-edxapp-cms 9.28 M
prod-edge-analytics-api 2.35 M
prod-edge-notes 2.3 M
prod-edge-edxapp-workers-lms 755 k
prod-edge-forum 675 k
prod-edge-edxapp-workers-cms 188 k
[idea] We might want 3 modes for our hacked NR agent:
We'll leave the datadog_diagnostics cleanup task for about a week, for cleanup 2024-10-14 or later. Moving to blocked for until then.
Ultimately, this ticket is for disabling New Relic APM across edxapp. We ran into trace related issues in DD when first attempting to disable NR APM. We later caused the same issue in Edge when simply disabling NR Tracing.
This bug has been observed in edxapp (LMS and CMS), enterprise-catalog, and registrar. It can be identified by searching for spans matching
operation_name:django.request -@_top_level:*
.Acceptance criteria
Things we have already tried
These should be checked off once they have already been either reverted or made permanent:
DD_DJANGO_INSTRUMENT_MIDDLEWARE
to reduce the noise when debugging huge traces.DD_TRACE_HEADER_TAGS
to debug tracing headersoperation_name:django.request
on All Spans since service entry spans were unreliable.EDXAPP_NEWRELIC_LICENSE_TEST_FREE
) and removed from AWS secrets managerDD_TRACE_PROPAGATION_STYLE_EXTRACT=none
service:edx-edxapp-* dirname:"/edx/var/log/supervisor" "[edx_arch_experiments.datadog_diagnostics.middleware]"
EXTRA_MIDDLEWARE_CLASSES
Django setting) [stage and prod, edge]DATADOG_DIAGNOSTICS_
) -- if it's justDATADOG_DIAGNOSTICS_ENABLE
it can be merged in any order, as it's just controlling noisy logs we don't have any more. [prod LMS was only instance]datadog.diagnostics.
) -- merge in any order, as these turn features onDD_TRACE_CELERY_ENABLED=false
, because some of the request spans in anomalous traces have missing parent spans that were celery-related.DATADOG_DIAGNOSTICS_CELERY_LOG_SIGNALS
(using edx-arch-experiments 4.3.0)EDXAPP_DDTRACE_PIP_SPEC
) that closes celery spans using a fallbackEDXAPP_DDTRACE_PIP_SPEC
Details
When we disabled NR APM in edxapp on June 6 we observed two anomalies with traces:
service:edx-edxapp-lms env:prod
dropped precipitously by 2-3x.However, we believe the actual traffic was unchanged. This is corroborated by the Django hit metrics remaining steady, as seen in the Service Catalog. We cannot find any relevant code or config changes that would have been deployed around that time.
Our current understanding is that the majority of Django web requests that are traced are not recorded as service entry spans, but are instead parented to a different trace. This causes several problems:
We can also reproduce this issue by setting "Tracing type: None" in the application settings in NR (usually set to Distributed Tracing).
Links