composite-primary-keys / composite_primary_keys

Composite Primary Keys support for Active Record
1.03k stars 351 forks source link

Improve query generation for cpk_in_predicate #591

Closed akiellor closed 1 year ago

akiellor commented 1 year ago

Query generation for cpk_in_predicate now constructs queries in the form:

SELECT *
FROM table_name
WHERE low_cardinality_key_part = 1 AND high_cardinality_part IN (1, 2)

It used to generate queries in the form:

SELECT *
FROM table_name
WHERE
  (low_cardinality_key_part = 1 AND high_cardinality_part = 1)
  OR (low_cardinality_key_part = 1 AND high_cardinality_part = 2)

This change improves the queries by reducing the overall length of the query, especially when loading many keys. But more importantly the new query will often result in Postgres performing an Index Scan instead of a Bitmap Heap Scan (assuming the right indices have been added).

codeodor commented 1 year ago

Does it handle it in the case of having low_cardinality_part with a count more than 1?

i.e.,

low_cardinality_part = 1 and high_cardinality_part = 1 or low_cardinality_part = 1 and high_cardinality_part = 2 or low_cardinality_part = 2 and high_cardinality_part = 3

is not the same as

low_cardinality_part in (1, 2) and high_cardinality_part in (1, 2, 3) if we have a table like:

low   high
1      1
1      2 
2      3
2      1

It would pull (2,1) in the latter case but not the former.

akiellor commented 1 year ago

@codeodor it does, but there is no spec. I'll add one.

akiellor commented 1 year ago

@codeodor Actually there is a spec.

  def test_in_with_multiple_primary_key_parts
    dep = Department.arel_table

    primary_keys = [[1, 1], [1, 2], [2, 3], [2, 4]]

    connection = ActiveRecord::Base.connection
    quoted_id_column = "#{connection.quote_table_name('departments')}.#{connection.quote_column_name('id')}"
    quoted_location_id_column = "#{connection.quote_table_name('departments')}.#{connection.quote_column_name('location_id')}"
    expected = "(#{quoted_id_column} = 1 AND #{quoted_location_id_column} IN (1, 2) OR #{quoted_id_column} = 2 AND #{quoted_location_id_column} IN (3, 4))"

    pred = cpk_in_predicate(dep, [:id, :location_id], primary_keys)
    assert_equal(with_quoted_identifiers(expected), pred.to_sql)
  end

Note how the query ends up being structured in a way where the high cardinality parts are grouped by the low cardinality parts and ORed together.

akiellor commented 1 year ago

Thanks for the feedback y'all.

@cfis,

I've refactored the code a little, let me know what you think.

It was definitely possible to do all the work with a single pass of the list. I honestly hadn't given it much thought as the actual query in Postgres was dominating the end to end execution in our app, but this is better.

I went with some different method names, different to your suggestions, but hopefully reflects the intent.

cfis commented 1 year ago

I like this better - thanks for the changes although I'm sure you a right that the query building and execution is where all the time is spent.

Added a second round of comments.

akiellor commented 1 year ago

@cfis, I think this is ready for another round. Thanks for your feedback.

codeodor commented 1 year ago

@akiellor nice! Thanks for pointing that out to me. I missed it on my first pass.

cfis commented 1 year ago

Great - thanks for making this happen!

akiellor commented 1 year ago

@cfis, is this change likely to be backported to all the different ActiveRecord version releases? We are still using ActiveRecord 6.1, so it would be nice to get it there if possible.

cfis commented 1 year ago

Happy to merge a MR if you want to make one.

Also, I'll cut a new master branch release in the next few days.

akiellor commented 1 year ago

I'm not sure of the exact process, I did just open a new PR into the ar_6.1.x branch with a the commits squashed from the previous PR.