Shopify / job-iteration

Makes your background jobs interruptible and resumable by design.
https://www.rubydoc.info/gems/job-iteration
MIT License
1.15k stars 45 forks source link

ActiveRecordCursors using multiple columns should use composite values rather than OR + AND #488

Open yjukaku opened 4 months ago

yjukaku commented 4 months ago

Hello, thanks for your work on this gem.

Summary

I noticed a performance issue when using maintenance_tasks and multiple columns for a cursor and fetching pages of records. The issue is with how the 'next page' query is written in ActiveRecordCursor, which causes Postgres to do inefficient data fetches and is over 100x slower than it could be.

Background

Our table requests has an index on created_at, uuid like this:

CREATE INDEX idx ON requests USING btree (created_at, uuid)

and our maintenance task, when defined like this:

class OurTask < MaintenanceTasks::Task
  def cursor_columns
    [:created_at, :uuid]
  end

  def collection
    Request.all
  end
end

will trigger queries for pages like this:

SELECT *
FROM requests 
WHERE ( created_at > '2023-01-01 00:00:00' OR ( created_at = '2023-01-01 00:00:00' AND ( uuid > 'aaaaaaaa-aaaa-aaaa-b46e-38cd3572ec76' ) ) ) 
ORDER BY created_at, uuid 
LIMIT 1000

This query does not perform as expected in Postgres, because of the OR. See related SO question here - the gist of it is that Postgres doesn't use an index condition but rather does a filter, which results in reading records from disk rather than using the index values.

In our table of about 30 million records, fetching a page of 1000 records was taking around 140s.

Solution

My proposed solution is to change how we construct the query here to use what Postgres calls "Composite Type Comparison", which I believe has wide support across DBMS. The query would look like:

SELECT *
FROM requests 
WHERE ( created_at, uuid) > ('2023-01-01 00:00:00', 'aaaaaaaa-aaaa-aaaa-b46e-38cd3572ec76') 
ORDER BY created_at, uuid 
LIMIT 1000

This should be functionally equivalent and also causes Postgres to use only the index to determine the specific set of records to fetch from disk.

This query returns the page of 1000 records in under 1s.

Would you accept a PR to change this (whether for Postgres only, or for all DBs)?

Mangara commented 2 months ago

Hi @yjukaku Thanks for the detailed report!

Given the massive performance difference for Postgres, it makes sense to me to add this as a configuration option. Maybe we can even enable it by default on Postgres? For MySQL I'm seeing slightly better performance with the current queries, so I'm hesitant to change them for everyone.