celery / django-celery-results

Celery result back end with django
Other
684 stars 203 forks source link

Only update the ChordCounter.count field when saving #347

Closed orf closed 1 year ago

orf commented 1 year ago

With the current implementation the following SQL query will be continually executed on every task:

UPDATE "django_celery_results_chordcounter" 
SET "group_id" = 'f81ac317-eb61-42a6-a750-7cc1feb78b8c', 
"sub_tasks" = '[[["6a2f07d8-6b6f-4286-9d71-dbead46e9b1c", null], null], ..', 
"count" = 122 
WHERE "django_celery_results_chordcounter"."id" = 118

If you are using a large number of sub-tasks, ranging in the thousands or tens of thousands, then continually sending the sub_tasks in every update can be very expensive.

If we use update_fields then we can skip this.

orf commented 1 year ago

FYI we deployed this change and saw an immediate drop in our write throughput,

Screenshot 2022-10-20 at 15 15 55

network traffic:

Screenshot 2022-10-20 at 15 16 13

and write IOPs.

auvipy commented 1 year ago

can you confirm this won't create any regression?

orf commented 1 year ago

Is it ever possible to confirm that it won’t cause any regression?

All I can say is that the current code doesn’t need to update anything other than the count column, and that I’ve validated this on complex real world workflows and it works as expected.

One of our larger chords results in a ~7mb JSON string being updated every time “save” is called, which may be multiple times per second. Our internal monitoring showed Postgres triggered about ~4,700 “internal” row updates and deletes to complete this, which causes significant stress on the various IO subsystems within PG.

Throughout the execution of all of these queries, the JSON string remained the same.

auvipy commented 1 year ago

yup legit, make sense. thanks a lot