GrafeasGroup / blossom

The website. The app. The everything.
6 stars 3 forks source link

Simplify transcribot queue query #138

Closed TimJentzsch closed 3 years ago

TimJentzsch commented 3 years ago

The query to get the queue for u/transcribot is currently very inefficient and causes big latency spikes: Latency spike of 423 ms

The query is defined here. It currently compiles to the following:

SELECT 
  "api_submission"."id", 
  "api_submission"."original_id", 
  "api_submission"."create_time", 
  "api_submission"."last_update_time", 
  "api_submission"."redis_id", 
  "api_submission"."claimed_by_id", 
  "api_submission"."completed_by_id", 
  "api_submission"."claim_time", 
  "api_submission"."complete_time", 
  "api_submission"."source_id", 
  "api_submission"."url", 
  "api_submission"."tor_url", 
  "api_submission"."archived", 
  "api_submission"."content_url", 
  "api_submission"."removed_from_queue", 
  "api_submission"."cannot_ocr" 
FROM 
  "api_submission" 
  LEFT OUTER JOIN "api_transcription" ON (
    "api_submission"."id" = "api_transcription"."submission_id"
  ) 
WHERE 
  (
    NOT "api_submission"."cannot_ocr" 
    AND "api_submission"."id" IN (
      SELECT 
        U0."id" 
      FROM 
        "api_submission" U0 
        INNER JOIN "api_transcription" U1 ON (U0."id" = U1."submission_id") 
      WHERE 
        U1."author_id" = % s
    ) 
    AND NOT "api_submission"."removed_from_queue" 
    AND "api_submission"."source_id" = % s 
    AND "api_transcription"."original_id" IS NULL
  ) 
LIMIT 
  10

This could be simplified to something like this:

SELECT 
  "api_submission"."id", 
  "api_submission"."original_id", 
  "api_submission"."create_time", 
  "api_submission"."last_update_time", 
  "api_submission"."redis_id", 
  "api_submission"."claimed_by_id", 
  "api_submission"."completed_by_id", 
  "api_submission"."claim_time", 
  "api_submission"."complete_time", 
  "api_submission"."source_id", 
  "api_submission"."url", 
  "api_submission"."tor_url", 
  "api_submission"."archived", 
  "api_submission"."content_url", 
  "api_submission"."removed_from_queue", 
  "api_submission"."cannot_ocr" 
FROM 
  "api_submission"
  INNER JOIN "api_transcription" ON (
    "api_submission"."id" = "api_transcription"."submission_id"
  ) 
WHERE 
  (
    NOT "api_submission"."cannot_ocr"
    AND "api_transcription"."author_id" = % s
    AND NOT "api_submission"."removed_from_queue" 
    AND "api_submission"."source_id" = % s 
    AND "api_transcription"."original_id" IS NULL
  ) 
LIMIT 
  10

I think we can just replace

id__in=Submission.objects.filter(transcription__author=transcribot),

by

transcription__author=transcribot

if I understand the syntax correctly.