etianen / django-watson

Full-text multi-table search application for Django. Easy to install and use, with good performance.
BSD 3-Clause "New" or "Revised" License
1.21k stars 129 forks source link

MySQL: Queries containing short words or stop-words do not return any result #125

Closed jvamvas closed 7 years ago

jvamvas commented 9 years ago

Currently, when MySQL is used, each word is enclosed in +{word}*:

def escape_mysql_boolean_query(search_text):
    return " ".join(
        '+{word}*'.format(
            word = word,
        )
        for word in escape_query(search_text).split()
    )

According to the MySQL 5.7 Reference for Boolean Full-Text Searches, a leading + indicates that the word must be present in all results. On the other hand, the * operator tells MySQL that the word should not be stripped from the query, even if it is too short or a stop-word.

This means that search queries containing a word smaller than ft_min_word_len or containing a stop-word will never return a result. For example, a search query containing the MySQL stop-word "new" fails, even though there is an item named "New York City".

>>> watson.search('york city')
[<SearchEntry: New York City>]
>>> watson.search('new york city')
[]
etianen commented 9 years ago

I think that it's intentional that short words and stop words are stripped from the query.

Otherwise, a search for "is" will return every item in your DB.

On Mon, 14 Sep 2015 at 11:45 Jannis Vamvas notifications@github.com wrote:

Currently, when MySQL is used, each word is enclosed in +{word}*:

def escape_mysql_boolean_query(search_text): return " ".join( '+{word}*'.format( word = word, ) for word in escape_query(search_text).split() )

According to the MySQL 5.7 Reference for Boolean Full-Text Searches https://dev.mysql.com/doc/refman/5.7/en/fulltext-boolean.html, a leading + indicates that the word must be present in all results. On the other hand, the * operator tells MySQL that the word should not be stripped from the query, even if it is too short or a stop-word.

This means that search queries containing a word smaller than ft_min_word_len or containing a stop-word will never return a result. For example, a search query containing the MySQL stop-word https://dev.mysql.com/doc/refman/5.7/en/fulltext-stopwords.html "new" fails, even though there is an item named "New York City".

watson.search('york city') [] watson.search('new york city') []

— Reply to this email directly or view it on GitHub https://github.com/etianen/django-watson/issues/125.

jvamvas commented 9 years ago

Just to be clear, is the behaviour of watson in my search example intentional as well?

etianen commented 9 years ago

No, that's not intentional. That would seem to be an annoying way MySQL strips stop words for the corpus, but not for the query.

I thought that the * suffix implied a prefix match on the results?

I wonder if there is a way to keep the prefix matching without creating this behaviour?

(Incidentally, the postgres backend doesn't have this issue)

On Mon, 14 Sep 2015 at 12:07 Jannis Vamvas notifications@github.com wrote:

Just to be clear, is the behaviour of watson in my search example intentional as well?

— Reply to this email directly or view it on GitHub https://github.com/etianen/django-watson/issues/125#issuecomment-140040939 .

jvamvas commented 9 years ago

An option is to remove the leading +, which would make the words optional. However, this would change the behaviour beyond fixing this bug, and some of the tests would have to be rewritten.

etianen commented 9 years ago

It would make the query into an OR query, rather than an AND query, which would make the MySQL backend behave differently to all the other backends. That's not really an option.

On Mon, 14 Sep 2015 at 12:43 Jannis Vamvas notifications@github.com wrote:

An option is to remove the leading +, which would make the words optional. However, this would change the behaviour beyond fixing this bug, and some of the tests would have to be rewritten.

— Reply to this email directly or view it on GitHub https://github.com/etianen/django-watson/issues/125#issuecomment-140046721 .

cpcbell commented 7 years ago

I'm seeing something similar ( i think ). When I search with certain terms I get no results, and I cannot figure out why. I thought it had something to do with the '@' character, but I also found other random terms that should result in a match but are returning no objects.

The most perplexing is the term "One", it should easily match the index but doesn't produce results.

The other really interesting one is the term "a@b.com" which is an exact match for a term in the index, again no objects returned.

django-watson (1.3.1) mysql 5.7 python 3.5.3

etianen commented 7 years ago

Consider adjusting your MySQL config to a shorter minimum word length?