Casecommons / pg_search

pg_search builds ActiveRecord named scopes that take advantage of PostgreSQL’s full text search
http://www.casebook.net
MIT License
1.35k stars 370 forks source link

`SELECT DISTINCT` triggers ActiveRecord::StatementInvalid in v1.0.0 #238

Open monfresh opened 9 years ago

monfresh commented 9 years ago

I just upgraded to 1.0.0 from 0.7.9 and I'm now getting a bunch of failures for tests that perform a pg_search query. I looked at the Changelog and the Readme, but I don't see any instructions for upgrading from 0.7.9 to 1.0.0 or any mention of breaking changes, so I assumed everything should still work.

Here is the specific error I am getting:

ActiveRecord::StatementInvalid:
       PG::InvalidColumnReference: ERROR:  for SELECT DISTINCT, ORDER BY expressions must appear in select list
       LINE 1: ...ocations"."id" = pg_search.pg_search_id  ORDER BY pg_search....
                                                                    ^
       : SELECT  DISTINCT "locations".* FROM "locations" INNER JOIN (SELECT "locations"."id" AS pg_search_id, (ts_rank(("locations"."tsv_body"), (to_tsquery('english', ''' ' || 'changeme' || ' ''' || ':*')), 0)) AS rank FROM "locations" WHERE ((("locations"."tsv_body") @@ (to_tsquery('english', ''' ' || 'changeme' || ' ''' || ':*'))))) pg_search ON "locations"."id" = pg_search.pg_search_id  ORDER BY pg_search.rank DESC, "locations"."id" ASC LIMIT 30 OFFSET 0

And here is the query: https://github.com/codeforamerica/ohana-api/blob/master/app/models/concerns/search.rb#L26-35

Thoughts?

nertzy commented 9 years ago

In 1.0.0, I moved the search relevance calculation into a subquery, which I think is what triggered your bug. From this point forward, I plan on following Semantic Versioning 2.0.0, so a change to the MAJOR version represents a breaking change, as in the change from 0.x to 1.x.

It looks like we should add test coverage for the SELECT DISTINCT case to pg_search. I think that is what isn't currently working with the subquery. For now, I think a quick workaround might be to chain .select("pg_search.rank", "locations.id") onto the end of that particular call, or to add those two fields to any existing .select() calls you may have.

Please let me know if that works for you. In the meantime I will try to think of a more generic solution.

I'm going to change the title of this issue so to reflect that it's about SELECT DISTINCT.

Thanks for the bug report!

aaronchi commented 9 years ago

I'm also having an issue with the new subquery. We are using activerecord_any_of to join pg_search scopes with other conditions via OR instead of AND and the subquery breaks the OR logic.

Is there any way to retain the old non-subquery method add an option to use it when desired?

nertzy commented 9 years ago

@aaronchi currently, no. Actually if you could open a separate issue for what you're seeing that would be helpful too so that we can try to add it to the test coverage.

I still haven't found the holy grail of a way to express the complexity of the pg_search query in a way that doesn't clobber other chained scopes. The subquery solution fixes a lot of the longstanding bugs, but adds some new ones as well, apparently.

I'd prefer not to add back the old query style if possible, but I'm still open to the idea.

monfresh commented 9 years ago

@nertzy Thanks for the details and the suggestion. Adding .select("pg_search.rank", "locations.id") to the end of my scope that uses pg_search only fixed a few search queries, but introduced a new error in other queries.

The search function in my API allows various parameters to be passed in, and the errors occur when the keyword parameter (which uses pg_search) is used alone or combined with other parameters.

Searches that only included the keyword parameter were the ones that resulted in the PG::InvalidColumnReference error. When I add your suggested fix to the keyword scope, like this:

scope :keyword, ->(keyword) { keyword_search(keyword).select("pg_search.rank", "locations.id") }

those same tests now fail with a different error:

Failure/Error: get api_search_index_url(keyword: 'food', subdomain: ENV['API_SUBDOMAIN'])
     ActiveModel::MissingAttributeError:
       missing attribute: organization_id

I also discovered another error caused by pg_search 1.0 that isn't resolved by your suggested fix. For a search that includes keyword and category parameters, I get the following error with or without your suggested fix:

Failure/Error: get api_search_index_url(category: 'Jobs', keyword: 'jobs', subdomain: ENV['API_SUBDOMAIN'])
     ActiveRecord::StatementInvalid:
       PG::UndefinedTable: ERROR:  invalid reference to FROM-clause entry for table "categories"
       LINE 1: ...N "locations"."id" = pg_search.pg_search_id WHERE "categorie...
                                                                    ^
       HINT:  Perhaps you meant to reference the table alias "categories_services_2".
       : SELECT  DISTINCT MAX("locations"."updated_at") FROM "locations" INNER JOIN "services" 
"services_locations" ON "services_locations"."location_id" = "locations"."id" INNER JOIN 
"categories_services" "services_categories_services_join" ON 
"services_categories_services_join"."service_id" = "services_locations"."id" INNER JOIN "categories" 
"categories_services_2" ON "categories_services_2"."id" = 
"services_categories_services_join"."category_id" INNER JOIN (SELECT "locations"."id" AS 
pg_search_id, (ts_rank(("locations"."tsv_body"), (to_tsquery('english', ''' ' || 'jobs' || ' ''' || ':*')), 0)) AS rank 
FROM "locations" INNER JOIN "services" ON "services"."location_id" = "locations"."id" INNER JOIN 
"categories_services" ON "categories_services"."service_id" = "services"."id" INNER JOIN "categories" 
ON "categories"."id" = "categories_services"."category_id" WHERE "categories"."name" = 'Jobs' AND 
((("locations"."tsv_body") @@ (to_tsquery('english', ''' ' || 'jobs' || ' ''' || ':*'))))) pg_search ON 
"locations"."id" = pg_search.pg_search_id WHERE "categories"."name" = 'Jobs' LIMIT 30 OFFSET 0

Note that I don't have a table called categories_services_2. That is definitely something new pg_search is doing because in version 0.7.9, the query does not try to use categories_services_2:

SELECT  DISTINCT MAX("locations"."updated_at") FROM "locations" INNER JOIN "services" ON 
"services"."location_id" = "locations"."id" INNER JOIN "categories_services" ON 
"categories_services"."service_id" = "services"."id" INNER JOIN "categories" ON "categories"."id" = 
"categories_services"."category_id" WHERE "categories"."name" = 'Jobs' AND ((("locations"."tsv_body") 
@@ (to_tsquery('english', ''' ' || 'jobs' || ' ''' || ':*')))) LIMIT 30 OFFSET 0

Note how much longer the query is in 1.0 versus 0.7.9.

nertzy commented 9 years ago

@monfresh could you try out the latest master? The fix for #235 might affect this use case as well.

LauraAntuna commented 9 years ago

I am getting the exact same mistake as @monfresh when performing a chained SELECT DISTINCT query, even after updating pg_search to version 1.0.2. The workaround suggested by @nertzy worked, but it would still be good to investigate the error and find a generic solution.

The chaining query that results in the error is Product.find_by_store(store).search_by_name(name), and the workaround was chaining .select("pg_search.rank", "products.*") to it. Here are the details of the model and error message:

class Product < ActiveRecord::Base
  include PgSearch

  has_many :area_category_products
  has_many :area_categories, through: :area_category_products

  pg_search_scope :search_by_name, against: :name,
                                   using: { tsearch: { prefix: true,
                                                       any_word: true } },
                                   ignoring: :accents

  def self.find_by_store(store)
    Product.joins(area_categories: :area).where(areas: { store_id: store.id })
      .uniq
  end
ActiveRecord::StatementInvalid:
       PG::InvalidColumnReference: ERROR:  for SELECT DISTINCT, ORDER BY expressions must appear in select list
       LINE 1: ..._search_id WHERE "areas"."store_id" = 1  ORDER BY pg_search....
                                                                    ^
       : SELECT DISTINCT "products".* FROM "products" INNER JOIN "area_category_products" ON "area_category_products"."product_id" = "products"."id" INNER JOIN "area_categories" ON "area_categories"."id" = "area_category_products"."area_category_id" INNER JOIN "areas" ON "areas"."id" = "area_categories"."area_id" INNER JOIN (SELECT "products"."id" AS pg_search_id, (ts_rank((to_tsvector('simple', unaccent(coalesce("products"."name"::text, '')))), (to_tsquery('simple', ''' ' || unaccent('prod') || ' ''' || ':*')), 0)) AS rank FROM "products" WHERE (((to_tsvector('simple', unaccent(coalesce("products"."name"::text, '')))) @@ (to_tsquery('simple', ''' ' || unaccent('prod') || ' ''' || ':*'))))) pg_search ON "products"."id" = pg_search.pg_search_id WHERE "areas"."store_id" = 1  ORDER BY pg_search.rank DESC, "products"."id" ASC
monfresh commented 9 years ago

@nertzy I wanted to give you an update on this issue from the current master branch. If I don't change my current code, I get a similar error as the one I originally reported, but it's now referencing pg_search_locations.pg_search_id whereas before it was pg_search.pg_search_id:

Failure/Error: get api_search_index_url(keyword: 'jobs', per_page: 1, subdomain: ENV['API_SUBDOMAIN'])
     ActiveRecord::StatementInvalid:
       PG::InvalidColumnReference: ERROR:  for SELECT DISTINCT, ORDER BY expressions must appear in select list
       LINE 1: ..."id" = pg_search_locations.pg_search_id  ORDER BY pg_search_...
                                                                    ^
       : SELECT  DISTINCT "locations".* FROM "locations" INNER JOIN (SELECT "locations"."id" AS pg_search_id, (ts_rank(("locations"."tsv_body"), (to_tsquery('english', ''' ' || 'jobs' || ' ''' || ':*')), 0)) AS rank FROM "locations" WHERE ((("locations"."tsv_body") @@ (to_tsquery('english', ''' ' || 'jobs' || ' ''' || ':*'))))) pg_search_locations ON "locations"."id" = pg_search_locations.pg_search_id  ORDER BY pg_search_locations.rank DESC, "locations"."id" ASC LIMIT 1 OFFSET 0

This means that I need to change your original workaround to .select("pg_search_locations.rank", "locations.id")

It seems like the workaround is now specific to the model where the pg_search_scope is defined.

When I do that, I get the same error I reported in my previous update:

Failure/Error: get api_search_index_url(keyword: 'jobs', per_page: 1, subdomain: ENV['API_SUBDOMAIN'])
     ActiveModel::MissingAttributeError:
       missing attribute: organization_id

If I select all of the columns in the Locations table, as suggested by @LauraAntuna, then all my tests pass:

.select("pg_search_locations.rank", "locations.*")

Another thing I noticed is that the SQL is now longer. In version 0.7.9 of the gem, the SQL generated was:

SELECT DISTINCT "locations".*, ((ts_rank(("locations"."tsv_body"), (to_tsquery('english', ''' ' || 'job' || ' ''' || ':*')), 0))) AS pg_search_rank FROM "locations" WHERE ((("locations"."tsv_body") @@ (to_tsquery('english', ''' ' || 'job' || ' ''' || ':*'))))  ORDER BY pg_search_rank DESC, "locations"."id" ASC

Now, the SQL adds an INNER JOIN when there wasn't one before:

SELECT DISTINCT pg_search_locations.rank, locations.* FROM "locations" INNER JOIN (SELECT "locations"."id" AS pg_search_id, (ts_rank(("locations"."tsv_body"), (to_tsquery('english', ''' ' || 'job' || ' ''' || ':*')), 0)) AS rank FROM "locations" WHERE ((("locations"."tsv_body") @@ (to_tsquery('english', ''' ' || 'job' || ' ''' || ':*'))))) pg_search_locations ON "locations"."id" = pg_search_locations.pg_search_id  ORDER BY pg_search_locations.rank DESC, "locations"."id" ASC

I haven't tested it yet on a large dataset to see if there's a performance impact.

j-dexx commented 9 years ago

I have the same issue using 1.0.4, and @LauraAntuna's solution works for me.

travisdahlke commented 9 years ago

For me, the solution was to add the with_pg_search_rank scope. I wonder if it's possible to detect this automatically when distinct is present.

mateuszpalyz commented 8 years ago

With 1.0.4 I had to add pg_search_series.rank to SELECTand switching to 1.0.5 needs pg_search_9cda38ea3a2bd78cbaecd4.rank, what is this stamp 9cda38ea3a2bd78cbaecd4, does it change between queries?

ikido commented 8 years ago

@mateuszpalyz, just use the with_pg_search_rank scope, like this: Post.search_by_title('some title').with_pg_search_rank.uniq. See scope source here (v.1.0.5) if interested, it just adds proper selects as suggested above. Adding it automatically is definately a good idea!

victorhazbun commented 8 years ago

I'm using version 1.0.6 of this gem and I get: "undefined method `with_pg_search_rank' for #\u003cCompanyFriendshipLocation::ActiveRecord_Relation:0x007f906e728160\u003e"

jchatel commented 8 years ago

+1 .uniq kills the query

mzahidriaz-tr commented 7 years ago

@ikido's solution works perfectly, it should be documented or something.

LilyReile commented 5 years ago

This is still an issue in 2.1.4. Using .uniq is a solid workaround, but it's not a true solution because there is a performance loss when moving the dedupe workload from Postgres to Ruby. Chaining .with_pg_search_rank.distinct fixed everything else, but broke when adding .count to the end of the query:

   ActiveRecord::StatementInvalid:
       PG::SyntaxError: ERROR:  syntax error at or near "AS"
       LINE 1: ...tegories.*, pg_search_a6216ea03e578f212dd604.rank AS pg_sear...
                                                                    ^
       : SELECT COUNT(DISTINCT categories.*, pg_search_a6216ea03e578f212dd604.rank AS pg_search_rank) FROM "categories" INNER JOIN "expert_categories" ON "expert_categories"."category_id" = "categories"."id" INNER JOIN "experts" ON "experts"."id" = "expert_categories"."expert_id" INNER JOIN "cities" ON "cities"."id" = "experts"."city_id" INNER JOIN (SELECT "categories"."id" AS pg_search_id, (ts_rank((setweight(to_tsvector('english', coalesce("categories"."name"::text, '')), 'A') || setweight(to_tsvector('english', coalesce("categories"."search_synonyms"::text, '')), 'B')), (to_tsquery('english', ''' ' || 'painting' || ' ''' || ':*')), 0)) AS rank FROM "categories" WHERE (((setweight(to_tsvector('english', coalesce("categories"."name"::text, '')), 'A') || setweight(to_tsvector('english', coalesce("categories"."search_synonyms"::text, '')), 'B')) @@ (to_tsquery('english', ''' ' || 'painting' || ' ''' || ':*'))))) AS pg_search_a6216ea03e578f212dd604 ON "categories"."id" = pg_search_a6216ea03e578f212dd604.pg_search_id WHERE "experts"."city_id" = $1 AND "experts"."status" = $2

We could use Ruby to count, but we'll continue to live with the .uniq workaround in fear that this would be a footgun to developers using scope for other purposes.

chrislabarge commented 5 years ago

@ikido's solution worked fine!

HusseinElMotayam commented 5 years ago

Another workaround that works well with distinct, count and pg_search is to reorder(''). It looks like the 'order by rank' part is problematic, so if ordering by pg_search's rank isn't required, then this workaround would be a great fit.

Example, Post.search_by_title('some title').reorder('').distinct.count

IgrekYY commented 4 years ago

thx @HusseinElMotayam Post.search_by_title('some title').reorder('').distinct works fine and doesn't kill query like .uniq

GMolini commented 1 year ago

@HusseinElMotayam solution fixed it for me. many thanks

adam-onboardpro commented 1 year ago

Still seeing the same issue if trying to chain multiple search scopes together. Example Post.search_by_title('some title').search_by_body('some text').with_pg_search_rank.distinct will produce #<ActiveRecord::StatementInvalid: PG::InvalidColumnReference: ERROR: for SELECT DISTINCT, ORDER BY expressions must appear in select list

Looking into a fix, but haven't came up with anything yet. Thanks for any help!