Open robrap opened 1 month ago
@timmc-edx: I wanted to bring your attention to Alex's note from Slack:
Edit: ah right, it's in the edx-edxapp-lms-workers APM. But why is the apply_async operation associated with the worker? Isn't the webapp the thing that calls apply_async() and does the redis operation to put the task and its args/kwargs on the queue?
This might be additional info about things that are off on the spans you will be looking into. If not, you can ignore as far as this ticket is concerned, other than to report back your findings eventually. Thanks.
On prod LMS for the past 2 days, all celery.apply
resource names that at least sometimes are top-level [query] in descending order by frequency:
[A] integrated_channels.integrated_channel.tasks.transmit_single_subsection_learner_data
[S] openedx.core.djangoapps.credentials.tasks.v1.tasks.send_grade_to_credentials
[S] integrated_channels.integrated_channel.tasks.transmit_single_learner_data
[A] openedx.core.djangoapps.programs.tasks.award_program_certificates
[S] openassessment.workflow.tasks.update_workflow_for_submission_task
[S] lms.djangoapps.gating.tasks.task_evaluate_subsection_completion_milestones
[S] openedx.core.djangoapps.programs.tasks.award_course_certificate
[S] openassessment.workflow.tasks.update_workflows_for_ora_block_task
[S] lms.djangoapps.certificates.tasks.generate_certificate
[A] openassessment.workflow.tasks.update_workflows_for_course_task
[S] openedx.core.djangoapps.notifications.tasks.send_notifications
[A] lms.djangoapps.grades.tasks.compute_grades_for_course_v2
[S] lms.djangoapps.grades.tasks.recalculate_subsection_grade_v3
[S] lms.djangoapps.bulk_email.tasks.send_course_email
[A] openedx.core.djangoapps.programs.tasks.revoke_program_certificates
Key: [A] = always top-level; [S] = sometimes top-level, sometimes child
Filtering on the most common top-level celery.apply over what I hope is a representative smaller time window in prod (no spikes):
env:prod resource_name:integrated_channels.integrated_channel.tasks.transmit_single_subsection_learner_data
@_top_level:*
(local root)operation_name:celery.run service:edx-edxapp-lms-workers
35,065operation_name:celery.apply service:edx-edxapp-lms-workers
3,104-@_top_level:*
(child)operation_name:celery.apply service:edx-edxapp-lms-workers
31,711operation_name:celery.apply service:edx-edxapp-lms
30So we have:
celery.apply
on the webapp; 3 orders of magnitude more than the total on the workers. Probably mostly kicked off as a result of a different celery task. No celery.run
on the webapp, as expected since it is called with apply_async
.celery.apply
as celery.run
.celery.apply
is 10x as common a child than a root span. (Spot check: parent is lms.djangoapps.grades.tasks.recalculate_subsection_grade_v3
).celery.apply
root as celery.run
(to within 1%).However, over a different time period that contained some spikes:
env:prod resource_name:integrated_channels.integrated_channel.tasks.transmit_single_subsection_learner_data
@_top_level:*
(local root)operation_name:celery.run service:edx-edxapp-lms-workers
1,179operation_name:celery.apply service:edx-edxapp-lms-workers
937-@_top_level:*
(child)operation_name:celery.apply service:edx-edxapp-lms-workers
1470operation_name:celery.apply service:edx-edxapp-lms
1Here, the celery.apply
count is over twice the celery.run
count, and the celery.apply
is about as common as child as root.
The transmit task has a recalculate span as parent. Analysis of this relationship over a recent time period:
a = env:prod service:edx-edxapp-lms-workers resource_name:lms.djangoapps.grades.tasks.recalculate_subsection_grade_v3
b = env:prod service:edx-edxapp-lms-workers resource_name:integrated_channels.integrated_channel.tasks.transmit_single_subsection_learner_data
a => b
) the parent is always celery.run
and the child is celery.apply
.a && NOT(a => b)
) but Datadog has a bug that is making it silently return zero results. So instead, we'll have to look for traces with a
and b
missing/present anywhere and assume that when present, it's always parent/child.
a && b
1.58ka && NOT(b)
6.62kNOT(a) && b
4.58ka
doesn't always call b
—or that b
is being orphaned some of the time. Unclear.Filed https://help.datadoghq.com/hc/en-us/requests/1877349 ("Orphaned spans on celery worker") with Datadog Support.
Answering the question "do other celery workers have this problem...
celery.run
, which are therefore probably celery workers: Service entry spans, by service, matching env:prod operation_name:celery.run
env:prod -operation_name:celery.run service:(a OR b OR ...)
Results, filtered down to edX services:
Service | Operation names |
---|---|
edx-edxapp-lms-workers | celery.apply, has_access, get_course.split_modulestore, enqueue_send_task, mako.template.render_unicode |
edx-edxapp-cms-workers | celery.apply, get_course.split_modulestore, django.template.render |
edx-ai-translations-worker | celery.apply |
license-manager-worker | celery.apply |
edx-video-encode-manager-worker | celery.apply |
So edxapp has multiple kinds of top-level spans, but the other workers have at most celery.apply
. (That doesn't necessarily mean that edxapp is different, though, other than having a wide variety of spans that can be orphaned.)
We're seeing orphaned spans on several of our Celery workers, identifiable as service entry spans that are not
operation_name:celery.run
:We noticed this because of missing code owner on root spans. We've restricted our monitors to just
celery.run
root spans, but we still want to fix this because about 10% of traces are broken.Filed https://help.datadoghq.com/hc/en-us/requests/1877349 ("Orphaned spans on celery worker") with Datadog Support. Update: Their proposed fix: https://github.com/DataDog/dd-trace-py/pull/11272