fatkodima / sidekiq-iteration

Make your long-running sidekiq jobs interruptible and resumable.
https://rubydoc.info/gems/sidekiq-iteration
MIT License
270 stars 8 forks source link

Using (id = x) or (id < x) is slow compared to (id <= x) #9

Closed sobrinho closed 5 months ago

sobrinho commented 5 months ago

Hi there!

We have a very large table (over 2,000,000,000 rows) and check that:

# explain analyze SELECT "paper_trail"."versions"."id" FROM "paper_trail"."versions" WHERE ((id = 278548164) OR (id < 278548164)) ORDER BY "paper_trail"."versions"."id" DESC LIMIT 10000;
                                                                              QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.57..398.87 rows=10000 width=8) (actual time=39187.250..39503.535 rows=10000 loops=1)
   ->  Index Only Scan Backward using versions_pkey on versions  (cost=0.57..9798836.89 rows=246018089 width=8) (actual time=39187.249..39502.581 rows=10000 loops=1)
         Filter: ((id = 278548164) OR (id < 278548164))
         Rows Removed by Filter: 2130806
         Heap Fetches: 425110
 Planning Time: 0.108 ms
 Execution Time: 39504.187 ms
(7 rows)

# explain analyze SELECT "paper_trail"."versions"."id" FROM "paper_trail"."versions" WHERE ((id = 278548164) OR (id < 278548164)) ORDER BY "paper_trail"."versions"."id" DESC LIMIT 10000;
                                                                              QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.57..405.30 rows=10000 width=8) (actual time=46282.322..47514.757 rows=10000 loops=1)
   ->  Index Only Scan Backward using versions_pkey on versions  (cost=0.57..9902842.46 rows=244675452 width=8) (actual time=46282.321..47513.382 rows=10000 loops=1)
         Filter: ((id = 278548164) OR (id < 278548164))
         Rows Removed by Filter: 2132216
         Heap Fetches: 492759
 Planning Time: 0.707 ms
 Execution Time: 47515.678 ms
(7 rows)

Although, if I use <= instead, it works quite fast:

Jason Wang's team/goco-api-edge/d9916ejm3b0csg=# explain analyze SELECT "paper_trail"."versions"."id" FROM "paper_trail"."versions" WHERE (id <= 278548164) ORDER BY "paper_trail"."versions"."id" DESC LIMIT 10000;
                                                                          QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.57..385.22 rows=10000 width=8) (actual time=0.011..5.137 rows=10000 loops=1)
   ->  Index Only Scan Backward using versions_pkey on versions  (cost=0.57..9417311.25 rows=244829902 width=8) (actual time=0.010..4.455 rows=10000 loops=1)
         Index Cond: (id <= 278548164)
         Heap Fetches: 3530
 Planning Time: 0.111 ms
 Execution Time: 5.579 ms
(6 rows)

Jason Wang's team/goco-api-edge/d9916ejm3b0csg=# explain analyze SELECT "paper_trail"."versions"."id" FROM "paper_trail"."versions" WHERE (id <= 278548164) ORDER BY "paper_trail"."versions"."id" DESC LIMIT 10000;
                                                                          QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.57..385.22 rows=10000 width=8) (actual time=0.010..5.532 rows=10000 loops=1)
   ->  Index Only Scan Backward using versions_pkey on versions  (cost=0.57..9417311.25 rows=244829902 width=8) (actual time=0.009..4.848 rows=10000 loops=1)
         Index Cond: (id <= 278548164)
         Heap Fetches: 3530
 Planning Time: 0.083 ms
 Execution Time: 5.959 ms
(6 rows)

The code calling iteration is quite simple:

    def build_enumerator(cursor:)
      active_record_relations_enumerator(
        MyModel,
        cursor:,
        batch_size: 10_000,
        order: :desc,
        columns: [:id]
      )
    end

Why we are doing this = OR < instead of <=?

sobrinho commented 5 months ago

Forgot to mention, Postgres 15.4.

sobrinho commented 5 months ago

Current workaround:

def build_enumerator(cursor:)
       Enumerator.new do |yielder|
         MyModel.in_batches(of: 10_000, order: :desc, start: cursor) do |relation|
           yielder.yield(relation, relation.minimum(:id))
         end
       end
     end
fatkodima commented 5 months ago

Thank you for such good reports! This and the other issue are already fixed on master. Can you verify it works and so I will release a new version soon?

sobrinho commented 5 months ago

@fatkodima is there a reason to have a custom implementation like this instead of using the Rails' find_each/in_batches/find_in_batches?

fatkodima commented 5 months ago

The main reason was that rails batches do not support iterating over non-primary key columns.

Maybe a good idea to add support iterating over custom columns to rails itself.

sobrinho commented 5 months ago
Screenshot 2024-05-10 at 11 12 49

Definitely working better than Rails' find_each, you can see the drop on iowait, it drops exactly after we deployed the master version.

fatkodima commented 5 months ago

Nice! Please, confirm that another bug is also fixed and I will release a new version then.

sobrinho commented 5 months ago
Screenshot 2024-05-10 at 12 31 01

CPU usage is high again, not sure why, maybe it's expected?

sobrinho commented 5 months ago

The other bug you mean the explicit primary column? If so, we removed it and it worked as expected now.

sobrinho commented 5 months ago

At this point I think there's something wrong with our database, I'm checking with our provider, the gem itself seems okay now.

fatkodima commented 5 months ago

Released a new version. Thanks again for the reports!

sobrinho commented 5 months ago

We have a huge table and neither find_each or sidekiq-iteration implementation was iterating good enough (investigation still going with the Postgres' provider) but for reference, we ended doing this:

    def build_enumerator(cursor:)
      cursor ||= MyModel.connection.select_value(Arel::Table.new(MyModel.sequence_name).project("last_value"))

      Enumerator.new do |yielder|
        cursor.step(1, -BATCH_SIZE) do |max|
          min = [max - BATCH_SIZE + 1, 1].max

          yielder.yield MyModel.where(id: min..max), min
        end
      end
    end

We are iterating backwards but you could do forwards if needed and we are using id BETWEEN [min] AND [max] to achieve a better performance.

The gotcha is that each_iteration might get called with no records or less records than BATCH_SIZE but we don't care about that in our use case.

fatkodima commented 5 months ago

You can probably use use_ranges: true from rails' batch iteration and it should be the fastest and less custom code.

fatkodima commented 5 months ago

I will probably add that option to the gem too.

sobrinho commented 5 months ago

When we use like this, the use_ranges defaults to true and the performance bottleneck still happens.

MyModel.in_batches(of: 10_000, order: :desc, start: cursor) do |relation|
  yielder.yield(relation, relation.minimum(:id))
end

Our provider is investigating why is that but using the BETWEEN worked the best so far.

fatkodima commented 5 months ago

Yeah, BETWEEN is the simplest and fastest. The bottleneck can be because of the relation.minimum(:id) call per iteration and also https://github.com/rails/rails/pull/51243 may be relevant.

sobrinho commented 5 months ago

@fatkodima just to drop a note, we discussed among other places of pluck_in_batches and etc.

For Rails, maybe we could have a object like this:

MyModel.in_batches(of: 10_000, order: :desc, start: cursor) do |relation, batch_meta_object|
  batch_meta_object.ids # ids from memory
  batch_meta_object.min_id # ids from memory
  batch_meta_object.max_id # ids from memory
end

But it's out of scope for this issue here. Let's consider this done at this point.