Closed sobrinho closed 5 months ago
Forgot to mention, Postgres 15.4.
Current workaround:
def build_enumerator(cursor:)
Enumerator.new do |yielder|
MyModel.in_batches(of: 10_000, order: :desc, start: cursor) do |relation|
yielder.yield(relation, relation.minimum(:id))
end
end
end
Thank you for such good reports! This and the other issue are already fixed on master. Can you verify it works and so I will release a new version soon?
@fatkodima is there a reason to have a custom implementation like this instead of using the Rails' find_each/in_batches/find_in_batches?
The main reason was that rails batches do not support iterating over non-primary key columns.
Maybe a good idea to add support iterating over custom columns to rails itself.
Definitely working better than Rails' find_each, you can see the drop on iowait, it drops exactly after we deployed the master version.
Nice! Please, confirm that another bug is also fixed and I will release a new version then.
CPU usage is high again, not sure why, maybe it's expected?
The other bug you mean the explicit primary column? If so, we removed it and it worked as expected now.
At this point I think there's something wrong with our database, I'm checking with our provider, the gem itself seems okay now.
Released a new version. Thanks again for the reports!
We have a huge table and neither find_each or sidekiq-iteration implementation was iterating good enough (investigation still going with the Postgres' provider) but for reference, we ended doing this:
def build_enumerator(cursor:)
cursor ||= MyModel.connection.select_value(Arel::Table.new(MyModel.sequence_name).project("last_value"))
Enumerator.new do |yielder|
cursor.step(1, -BATCH_SIZE) do |max|
min = [max - BATCH_SIZE + 1, 1].max
yielder.yield MyModel.where(id: min..max), min
end
end
end
We are iterating backwards but you could do forwards if needed and we are using id BETWEEN [min] AND [max]
to achieve a better performance.
The gotcha is that each_iteration
might get called with no records or less records than BATCH_SIZE
but we don't care about that in our use case.
You can probably use use_ranges: true
from rails' batch iteration and it should be the fastest and less custom code.
I will probably add that option to the gem too.
When we use like this, the use_ranges
defaults to true
and the performance bottleneck still happens.
MyModel.in_batches(of: 10_000, order: :desc, start: cursor) do |relation|
yielder.yield(relation, relation.minimum(:id))
end
Our provider is investigating why is that but using the BETWEEN
worked the best so far.
Yeah, BETWEEN
is the simplest and fastest. The bottleneck can be because of the relation.minimum(:id)
call per iteration and also https://github.com/rails/rails/pull/51243 may be relevant.
@fatkodima just to drop a note, we discussed among other places of pluck_in_batches and etc.
For Rails, maybe we could have a object like this:
MyModel.in_batches(of: 10_000, order: :desc, start: cursor) do |relation, batch_meta_object|
batch_meta_object.ids # ids from memory
batch_meta_object.min_id # ids from memory
batch_meta_object.max_id # ids from memory
end
But it's out of scope for this issue here. Let's consider this done at this point.
Hi there!
We have a very large table (over 2,000,000,000 rows) and check that:
Although, if I use
<=
instead, it works quite fast:The code calling iteration is quite simple:
Why we are doing this
= OR <
instead of<=
?