etianen / django-watson

Full-text multi-table search application for Django. Easy to install and use, with good performance.
BSD 3-Clause "New" or "Revised" License
1.2k stars 130 forks source link

watson don't find result when searchstring contains a dash #223

Closed soerenbe closed 6 years ago

soerenbe commented 6 years ago

Hi there, I discovered a bug or at least an unexpected behaviour when searching for strings in postgres databases. It seems like watson not correctly find strings with a dash ('-'). If you have for example following entries:

germany-dev-east england-dev-east

watson will not find a search for 'dev-east'. Searching for 'germany-dev' on the other hand works.

I tracked it down to the postgres layer and discovered that there seems to be an issue, when passing the search string to the postgres function 'to_tsquery'. It looks like watson simply passes the search string to that function which works in most cases, but not in all.

See https://www.postgresql.org/docs/9.5/static/textsearch-controls.html 12.3.2:

to_tsquery creates a tsquery value from querytext, which must consist of single tokens separated by the Boolean operators & (AND), | (OR) and ! (NOT). These operators can be grouped using parentheses. In other words, the input to to_tsquery must already follow the general rules for tsquery input, as described in Section 8.11.

To reproduce the error I have created a small Testcase. I used python3, but the error is the same at python2.

Unfortunately this digs quite deep into watson and postgres and I cant see a quick fix. So I cant see a ways for me to fix this. Maybe there is a failure in my analysis and I would be happy to find a solution to this issue.

Steps to reproduce

$ lsb_release -a
Distributor ID: Ubuntu
Description:    Ubuntu 16.04.3 LTS
Release:    16.04
Codename:   xenial

$ apt-cache policy postgresql
postgresql:
  Installed: (none)
  Candidate: 9.5+173
  Version table:
     9.5+173 500
        500 http://de.archive.ubuntu.com/ubuntu xenial/main amd64 Packages
        500 http://de.archive.ubuntu.com/ubuntu xenial/main i386 Packages

$ python3 --version
Python 3.5.2

$ git clone https://github.com/etianen/django-watson
$ cd django-watson/
$ python3 -m venv venv
$ source venv/bin/activate
(venv) $ pip install django psycopg2
(venv) $ pip list
Django (1.11.7)
pip (8.1.1)
pkg-resources (0.0.0)
psycopg2 (2.7.3.2)
pytz (2017.3)
setuptools (20.7.0)

(venv) $ cd tests/
(venv) /tests$ ln -s ../watson/ .

# Add the test class from bellow

(venv) $ python3 runtests.py -d psql

Testclass

class DashTest(TestCase):

    def setUp(self):
        self.registered_models = watson.get_registered_models()
        for model in self.registered_models:
            watson.unregister(model)
        watson.register(WatsonTestModel2, exclude=("id",))

    def tearDown(self):
        for model in self.registered_models:
            watson.register(model)
        watson.unregister(WatsonTestModel2)
        WatsonTestModel2.objects.all().delete()
        del self.test_obj1
        del self.test_obj2
        SearchEntry.objects.all().delete()

    def test_demo(self):

        self.test_obj1 = WatsonTestModel2.objects.create(
            title="title-barbar",
            content="search content",
            description="search description",
            )
        self.test_obj2 = WatsonTestModel2.objects.create(
            title="title-foobar",
            content="search content",
            description="search description",
            )
        # Following tests work
        self.assertEqual(watson.search('title-barbar').count(), 1)
        self.assertEqual(watson.search('title-foobar').count(), 1)
        self.assertEqual(watson.search('title-foo').count(), 1)
        self.assertEqual(watson.search('title').count(), 2)
        self.assertEqual(watson.search('title-').count(), 2)
        self.assertEqual(watson.search('foobar').count(), 1)
        self.assertEqual(watson.search('-foobar').count(), 1)
        # This fails
        self.assertEqual(watson.search('e-foobar').count(), 1)
etianen commented 6 years ago

Thanks for the detailed report!

I think that this is actually working as expected. Postgres does prefix matching, meaning that it'll match from the start of the word, but not the mid of the word. This means that, for the word, "foobar", the following matches will work: "f", "fo", "foo", "foob", "fooba", "foobar".

It looks like postgres is considering "germany-dev-east" to be a single word, in this case. So "germany-dev" would work as a prefix match, but "dev-east" would not.

On 6 November 2017 at 19:53, Sören Berger notifications@github.com wrote:

Hi there, I discovered a bug or at least an unexpected behaviour when searching for strings in postgres databases. It seems like watson not correctly find strings with a dash ('-'). If you have for example following entries:

germany-dev-east england-dev-east

watson will not find a search for 'dev-east'. Searching for 'germany-dev' on the other hand works.

I tracked it down to the postgres layer and discovered that there seems to be an issue, when passing the search string to the postgres function 'to_tsquery'. It looks like watson simply passes the search string to that function which works in most cases, but not in all.

See https://www.postgresql.org/docs/9.5/static/textsearch-controls.html 12.3.2:

to_tsquery creates a tsquery value from querytext, which must consist of single tokens separated by the Boolean operators & (AND), | (OR) and ! (NOT). These operators can be grouped using parentheses. In other words, the input to to_tsquery must already follow the general rules for tsquery input, as described in Section 8.11.

To reproduce the error I have created a small Testcase. I used python3, but the error is the same at python2.

Unfortunately this digs quite deep into watson and postgres and I cant see a quick fix. So I cant see a ways for me to fix this. Maybe there is a failure in my analysis and I would be happy to find a solution to this issue. Steps to reproduce

$ lsb_release -a Distributor ID: Ubuntu Description: Ubuntu 16.04.3 LTS Release: 16.04 Codename: xenial

$ apt-cache policy postgresql postgresql: Installed: (none) Candidate: 9.5+173 Version table: 9.5+173 500 500 http://de.archive.ubuntu.com/ubuntu xenial/main amd64 Packages 500 http://de.archive.ubuntu.com/ubuntu xenial/main i386 Packages

$ python3 --version Python 3.5.2

$ git clone https://github.com/etianen/django-watson $ cd django-watson/ $ python3 -m venv venv $ source venv/bin/activate (venv) $ pip install django psycopg2 (venv) $ pip list Django (1.11.7) pip (8.1.1) pkg-resources (0.0.0) psycopg2 (2.7.3.2) pytz (2017.3) setuptools (20.7.0)

(venv) $ cd tests/ (venv) /tests$ ln -s ../watson/ .

Add the test class from bellow

(venv) $ python3 runtests.py -d psql

Testclass

class DashTest(TestCase):

def setUp(self):
    self.registered_models = watson.get_registered_models()
    for model in self.registered_models:
        watson.unregister(model)
    watson.register(WatsonTestModel2, exclude=("id",))

def tearDown(self):
    for model in self.registered_models:
        watson.register(model)
    watson.unregister(WatsonTestModel2)
    WatsonTestModel2.objects.all().delete()
    del self.test_obj1
    del self.test_obj2
    SearchEntry.objects.all().delete()

def test_demo(self):

    self.test_obj1 = WatsonTestModel2.objects.create(
        title="title-barbar",
        content="search content",
        description="search description",
        )
    self.test_obj2 = WatsonTestModel2.objects.create(
        title="title-foobar",
        content="search content",
        description="search description",
        )
    # Following tests work
    self.assertEqual(watson.search('title-barbar').count(), 1)
    self.assertEqual(watson.search('title-foobar').count(), 1)
    self.assertEqual(watson.search('title-foo').count(), 1)
    self.assertEqual(watson.search('title').count(), 2)
    self.assertEqual(watson.search('title-').count(), 2)
    self.assertEqual(watson.search('foobar').count(), 1)
    self.assertEqual(watson.search('-foobar').count(), 1)
    # This fails
    self.assertEqual(watson.search('e-foobar').count(), 1)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/etianen/django-watson/issues/223, or mute the thread https://github.com/notifications/unsubscribe-auth/AAJFCHvHoG5wfMkId5IItN1aZgVJ5Gpbks5sz2O5gaJpZM4QTyI3 .

soerenbe commented 6 years ago

Thanks for your feedback. I think you are right. It seems that you can not find a string in a word. Normally this is fine, but in my scenario this seems unintuitive. I thinks I have to ship around that somehow.