Casecommons / pg_search

pg_search builds ActiveRecord named scopes that take advantage of PostgreSQL’s full text search
http://www.casebook.net
MIT License
1.33k stars 372 forks source link

Strange behaviour when searching string with hyphen / dash #349

Open alpracka opened 7 years ago

alpracka commented 7 years ago

Hi, I've found very similar issue #19 and possibly related to #117

I have a simple setup:

class Product < ActiveRecord::Base
  include PgSearch
  pg_search_scope :search_by_name, against: :name , using: {
    tsearch: {
      prefix: true
    }
  }
end

And these troubles:

Product.create(name: 'Hikvision DS-7604NI-E1/4P/A')
Product.search_by_name("Hikvision DS-7604NI-E1/4P/A").any?
# => true
Product.search_by_name("Hik DS-7604NI").any?
# => true
Product.search_by_name("Hik DS-7604").any?
# => false # true expected
Product.search_by_name("DS-7604").any?
# => false # true expected

I've tried several options but nothing works as expected. I also tried replacing hyphen/dash with space character, then Product.search_by_name("DS 7604").any? # => true but Product.search_by_name("Hikvision DS 7604NI E1/4P/A").any? # => false

Tested with pg 9.6, sorry if it's not related with the gem but I don't know how pg tsearch work yet so trying first shot here. Thanks for help.

kluzny commented 7 years ago

Few possible things, just an amateur myself here:

The hypen behaviour is based on how PG is splitting up the words. PG uses dictionaries to define how to break up the works into tokens or lexemes: https://www.postgresql.org/docs/9.1/static/textsearch-dictionaries.html

By default pg ships with simple, and possibly based on your locale english. Slashes and hypens aren't tokenized, and I think in the simple dictionary they are used as word boundaries. Using a custom dictionary looks like kind of a hassle, but will let you tokenize differently.

You might want to switch to trigram, might give you better results on similarity instead of prefix matching.

vpereira commented 1 year ago

ping, any idea how to solve it?

I'm using pg-search to search through paths like /var/foo and /var/foo-bar and pg_search doesn't return anything

my code looks like:

class Repository < ApplicationRecord
  include PgSearch::Model
  belongs_to :category
  pg_search_scope :search_name, against: %i[name path], using: :trigram
end

calling Repository.search_name "bci" generates the following query

  Repository Load (2.4ms)  SELECT "repositories".* FROM "repositories" INNER JOIN (SELECT "repositories"."id" AS pg_search_id, (ts_rank((to_tsvector('simple', coalesce("repositories"."name"::text, '')) || to_tsvector('simple', coalesce("repositories"."path"::text, ''))), (to_tsquery('simple', ''' ' || 'bci' || ' ''')), 0)) AS rank FROM "repositories" WHERE ('bci' % (coalesce("repositories"."name"::text, '') || ' ' || coalesce("repositories"."path"::text, '')))) AS pg_search_a8038884f8ec5d3389ecea ON "repositories"."id" = pg_search_a8038884f8ec5d3389ecea.pg_search_id ORDER BY pg_search_a8038884f8ec5d3389ecea.rank DESC, "repositories"."id" ASC
(Object doesn't support #inspect)                                                                     
=>                                              

I for sure have objects to be found in my db:

irb(main):018:0> Repository.all.map { |c| c.name.match?(/^bci\//) }.count
  Repository Load (3.9ms)  SELECT "repositories".* FROM "repositories"
=> 48  

On my db, I have theoretically the necessary extensions:

root@4fb9d81eb9c7:/# apt-get install postgresql-common
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
postgresql-common is already the newest version (246.pgdg110+1).
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
root@4fb9d81eb9c7:/# exit
9mm commented 1 year ago

Also find this highly annoying, I didnt expect to get this far and be stopped by a hyphen... my database is mostly model numbers

guar47 commented 1 year ago

Has anyone managed to solve this? I haven't tried trigram yet but using a different type of dictionaries don't help.

brusacco commented 3 months ago

Any update on this issue? I see this: https://github.com/Casecommons/pg_search/commit/07161f169ddd2508b2116e821faab2610e5dc125

- DISALLOWED_TSQUERY_CHARACTERS = /['?\-\\:]/
+DISALLOWED_TSQUERY_CHARACTERS = /['?\\:]/

So basically they removed the '-' from the DISALLOWED_TSQUERY_CHARACTERS, is there a way to set this as a setting in the project, the DISALLOWED_TSQUERY_CHARACTERS?