edx / edx-arch-experiments

A plugin to include applications under development by the architecture team at edx
GNU Affero General Public License v3.0
0 stars 3 forks source link

Fix orphaned spans on Celery workers #822

Open robrap opened 1 month ago

robrap commented 1 month ago

We're seeing orphaned spans on several of our Celery workers, identifiable as service entry spans that are not operation_name:celery.run:

Service Operation names
edx-edxapp-lms-workers celery.apply, has_access, get_course.split_modulestore, enqueue_send_task, mako.template.render_unicode
edx-edxapp-cms-workers celery.apply, get_course.split_modulestore, django.template.render
edx-ai-translations-worker celery.apply
license-manager-worker celery.apply
edx-video-encode-manager-worker celery.apply

We noticed this because of missing code owner on root spans. We've restricted our monitors to just celery.run root spans, but we still want to fix this because about 10% of traces are broken.

Filed https://help.datadoghq.com/hc/en-us/requests/1877349 ("Orphaned spans on celery worker") with Datadog Support. Update: Their proposed fix: https://github.com/DataDog/dd-trace-py/pull/11272

robrap commented 1 month ago

@timmc-edx: I wanted to bring your attention to Alex's note from Slack:

Edit: ah right, it's in the edx-edxapp-lms-workers APM. But why is the apply_async operation associated with the worker? Isn't the webapp the thing that calls apply_async() and does the redis operation to put the task and its args/kwargs on the queue?

This might be additional info about things that are off on the spans you will be looking into. If not, you can ignore as far as this ticket is concerned, other than to report back your findings eventually. Thanks.

timmc-edx commented 1 month ago

Numbers across resources

On prod LMS for the past 2 days, all celery.apply resource names that at least sometimes are top-level [query] in descending order by frequency:

[A] integrated_channels.integrated_channel.tasks.transmit_single_subsection_learner_data
[S] openedx.core.djangoapps.credentials.tasks.v1.tasks.send_grade_to_credentials
[S] integrated_channels.integrated_channel.tasks.transmit_single_learner_data
[A] openedx.core.djangoapps.programs.tasks.award_program_certificates
[S] openassessment.workflow.tasks.update_workflow_for_submission_task
[S] lms.djangoapps.gating.tasks.task_evaluate_subsection_completion_milestones
[S] openedx.core.djangoapps.programs.tasks.award_course_certificate
[S] openassessment.workflow.tasks.update_workflows_for_ora_block_task
[S] lms.djangoapps.certificates.tasks.generate_certificate
[A] openassessment.workflow.tasks.update_workflows_for_course_task
[S] openedx.core.djangoapps.notifications.tasks.send_notifications
[A] lms.djangoapps.grades.tasks.compute_grades_for_course_v2
[S] lms.djangoapps.grades.tasks.recalculate_subsection_grade_v3
[S] lms.djangoapps.bulk_email.tasks.send_course_email
[A] openedx.core.djangoapps.programs.tasks.revoke_program_certificates

Key: [A] = always top-level; [S] = sometimes top-level, sometimes child

Numbers drilldown

Filtering on the most common top-level celery.apply over what I hope is a representative smaller time window in prod (no spikes):

So we have:

However, over a different time period that contained some spikes:

Here, the celery.apply count is over twice the celery.run count, and the celery.apply is about as common as child as root.

Trace-level analysis

The transmit task has a recalculate span as parent. Analysis of this relationship over a recent time period:

a = env:prod service:edx-edxapp-lms-workers resource_name:lms.djangoapps.grades.tasks.recalculate_subsection_grade_v3
b = env:prod service:edx-edxapp-lms-workers resource_name:integrated_channels.integrated_channel.tasks.transmit_single_subsection_learner_data
timmc-edx commented 1 month ago

Filed https://help.datadoghq.com/hc/en-us/requests/1877349 ("Orphaned spans on celery worker") with Datadog Support.

timmc-edx commented 1 month ago

Answering the question "do other celery workers have this problem...

Results, filtered down to edX services:

Service Operation names
edx-edxapp-lms-workers celery.apply, has_access, get_course.split_modulestore, enqueue_send_task, mako.template.render_unicode
edx-edxapp-cms-workers celery.apply, get_course.split_modulestore, django.template.render
edx-ai-translations-worker celery.apply
license-manager-worker celery.apply
edx-video-encode-manager-worker celery.apply

So edxapp has multiple kinds of top-level spans, but the other workers have at most celery.apply. (That doesn't necessarily mean that edxapp is different, though, other than having a wide variety of spans that can be orphaned.)