Open robrap opened 4 months ago
Noting some examples of mistagged log files and the configs that captured them:
/edx/var/log/supervisor/lms_high_mem_1-stderr.log
on host:i-0212cb4c9d98835fa
is tagged with both service:edx-edxapp-lms-workers
and service:edx-edxapp-cms-workers
path: "{{ COMMON_LOG_DIR }}/supervisor/lms_*.log"
and service: edx-edxapp-lms-workers
service:edx-edxapp-cms-workers
/edx/var/log/lms/edx.log
on host:i-0c0bf18165c20e4e6
is tagged with both service:edx-edxapp-cms-workers
and service:edx-edxapp-lms
path: "{{ COMMON_LOG_DIR }}/lms/*"
and service: edx-edxapp-lms
service:edx-edxapp-cms-workers
is again a host tagCould this just be another effect of https://github.com/edx/edx-arch-experiments/issues/724?
I'm not sure if the mistagging is related, but the duplicate log issue is old. Can we concentrate on resolving that issue first? The tagging issue may need to be its own ticket.
I've been having difficulty figuring out if there is duplicate logging. But I think I have an example of that now:
/edx/var/log/lms/edx.log
tagged env:prod service:edx-edxapp-cms-workers service:edx-edxapp-lms
:
Jul 17 19:13:12 ip-10-2-71-115 [service_variant=lms][celery.app.trace][env:prod-edx-edxapp] INFO [ip-10-2-71-115 2425] [user None] [ip None] [trace.py:128] - Task lms.djangoapps.gating.tasks.task_evaluate_subsection_completion_milestones[...UUID...] succeeded in 0.02717464299985295s: None
/edx/var/log/supervisor/lms_default_9-stderr.log
tagged env:prod service:edx-edxapp-cms-workers service:edx-edxapp-lms-workers
:
2024-07-17 19:13:12,544 INFO 2425 [celery.app.trace] [user None] [ip None] trace.py:128 - Task lms.djangoapps.gating.tasks.task_evaluate_subsection_completion_milestones[...UUID...] succeeded in 0.02717464299985295s: None
So we have duplication between the service and supervisor logs, here.
A search for "Task lms.djangoapps.gating.tasks.task_evaluate_subsection_completion_milestones" "succeeded"
faceted by dirname
and filename
shows about equal numbers of results for /edx/var/log/supervisor/lms_default_<N>-stderr.log
and /edx/var/log/lms/edx.log
. However, looking at (dirname:/edx/var/log/supervisor filename:lms_default_*-stderr.log) OR (dirname:/edx/var/log/lms filename:edx.log*) celery.app.trace
for various periods shows varying ratios of results, with very roughly 1.5x as many in the supervisor logs. I'm really not clear on what's getting logged in there and why there's a discrepancy in the counts for this subset of logs.
From discussion:
Ticket title and description updated to match findings.
The edxapp logs seem to be duplicated between edx.log and the supervisor logs in Datadog. This causes confusion and increases costs.
AC:
service:edx-edxapp-lms
in DD.See example log message.
Also see https://2u-internal.atlassian.net/browse/GSRE-1543?focusedCommentId=4921284 for some related problems.