etianen / django-watson

Full-text multi-table search application for Django. Easy to install and use, with good performance.
BSD 3-Clause "New" or "Revised" License
1.2k stars 130 forks source link

Problem with text case using SQLite3 backend #220

Closed armpogart closed 6 years ago

armpogart commented 6 years ago

We have small website that have SQLLite3 backend. Website is on russian. Watson is integrated and is working ok in local environment (Python 3.5). Unfortunately our hosting provider has only Python 2.x support for Django, so we have done some workarounds and fixes (recommended by django) to have unicode literals working there. Search is working except that if we search word in lowercase (but it is actually in uppercase) it's not finding it (for russian, english version is ok).

We have overridden search adapter, for some custom regex:

from __future__ import unicode_literals
from watson import search as watson
import re

class CatalogItemSearchAdapter(watson.SearchAdapter):
    def get_content(self, obj):
        m = " ".join(re.split('((?:(?:https?):\/\/)?((?:\.?www.?)?(.*\.[a-z]*)))', obj.website))

        return super(CatalogItemSearchAdapter, self).get_content(obj) + m

Any workaround or tip, where could be the problem.

etianen commented 6 years ago

The database backend is configured for a particular language.

https://github.com/etianen/django-watson/wiki/Language-support

So you can configure it to support search normalisation for English or Russian. Unfortunately, you can't configure it for both!

If your site is single-language, then simply switch watson over to the new search config. You may need to run:

./manage.py migrate watson zero ./manage.py migrate watson

To rebuild the index in Russian after changing the setting.

On 10 October 2017 at 16:31, Arman Poghosyan notifications@github.com wrote:

We have small website that have SQLLite3 backend. Website is on russian. Watson is integrated and is working ok in local environment (Python 3.5). Unfortunately our hosting provider has only Python 2.x support for Django, so we have done some workarounds and fixes (recommended by django) to have unicode literals working there. Search is working except that if we search word in lowercase (but it is actually in uppercase) it's not finding it (for russian, english version is ok).

We have overridden search adapter, for some custom regex:

from future import unicode_literalsfrom watson import search as watsonimport re

class CatalogItemSearchAdapter(watson.SearchAdapter): def get_content(self, obj): m = " ".join(re.split('((?:(?:https?):\/\/)?((?:.?www.?)?(..[a-z])))', obj.website))

    return super(CatalogItemSearchAdapter, self).get_content(obj) + m

Any workaround or tip, where could be the problem.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/etianen/django-watson/issues/220, or mute the thread https://github.com/notifications/unsubscribe-auth/AAJFCEhJaicofRIrlMD3iNtZtKdHaFdeks5sq42zgaJpZM4P0Igv .

armpogart commented 6 years ago

Unfortunately the website is in both languages)) What problems will I incur with English if I reindex for Russian now? Similar ones?

etianen commented 6 years ago

Yes, sadly.

I believe postgres can support two languages in a database, on a per-row basis, but you'll have to drop watson and go with raw postgres full text search querying. Totally do-able, but more effort.

On 14 October 2017 at 23:54, Arman Poghosyan notifications@github.com wrote:

Unfortunately the website is in both languages)) What problems will I incur with English if I reindex for Russian now? Similar ones?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/etianen/django-watson/issues/220#issuecomment-336672920, or mute the thread https://github.com/notifications/unsubscribe-auth/AAJFCJSc9UBDqhdCDupTErvrKho5JryQks5ssTuMgaJpZM4P0Igv .

armpogart commented 6 years ago

Yeah, unfortunately the project is rather old, and built on top of sqlite with rather bad practices. It's okay, as it is very simple website with little traffic. And we needed to add only search functionality, so it's not viable to migrate project to postgres only for that. Anyways, thanks. I will try to change the language and check the behavior, will report here soon (for others reference) and close the issue.

etianen commented 6 years ago

Oh, haha, the language feature only works for postgres, even in single language mode. I just assumed you were using it! :)

The sqlite backend uses sqlite's case-insensitive regex to perform the search. I guess it's not able to work on the Russian alphabet.

Try lowercasing the content and description in your custom search backend, using Python. You may have to use a library to support lowercasing Russian, I don't know. But if you feed all lowercase into watson, then the search will be case insensitive.

On 17 October 2017 at 12:41, Arman Poghosyan notifications@github.com wrote:

Yeah, unfortunately the project is rather old, and built on top of sqlite with rather bad practices. It's okay, as it is very simple website with little traffic. And we needed to add only search functionality, so it's not viable to migrate project to postgres only for that. Anyways, thanks. I will try to change the language and check the behavior, will report here soon (for others reference) and close the issue.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/etianen/django-watson/issues/220#issuecomment-337204489, or mute the thread https://github.com/notifications/unsubscribe-auth/AAJFCE0BfMxv9X5KYajuP8FUrgTwV_Kiks5stJJygaJpZM4P0Igv .