etianen / django-watson

Full-text multi-table search application for Django. Easy to install and use, with good performance.
BSD 3-Clause "New" or "Revised" License
1.21k stars 129 forks source link

WIP: Support multilingual searches #249

Closed CuriousLearner closed 4 years ago

CuriousLearner commented 6 years ago

An attempt to refactor the library to make multilingual searches works for #248

I'm not sure, but seems like the build watson command isn't indexing the SearchEntry properly.

I'm using a different configuration for chinese. I see that buildwatson command activates the particular language before doing anything, but how does it know which parser to use before indexing the data.

@etianen Can you please help?

CuriousLearner commented 6 years ago

Alright, further research show that watson_searchentry table has not filled in the search_tsv for Chinese characters. Although it did it for English.

Note that I already used build watson with zh-cn to ensure Chinese characters are parsed.

etianen commented 6 years ago

The big problem here is that the watson postgres backend creates a database table and an index using a single language catalogue. Adding multiple search backends with different language settings means that they'll both conflict with each other, and fight over which is the "true" language for the index.

To make this work, each search backend would need it's own database column added to the watson table, containing the tsvector parsed according to the desired language. This is a major refactoring effort.

CuriousLearner commented 6 years ago

@etianen Yeah, I guessed that would need a re-factor as well, once I started working on this.

I can try to see what I can do here. But do you have any idea, why the search_tsv doesn't populate Chinese characters, if I run buildwatson for zh-cn?

etianen commented 6 years ago

buildwatson will use the search settings for the search backend that was the default when ./manage.py migrate was run.

So if you ran ./manage.py migrate when the default search backend was configured for english, then all content will be indexed in english.

On 8 June 2018 at 12:03, Sanyam Khurana notifications@github.com wrote:

@etianen https://github.com/etianen Yeah, I guessed that would need a re-factor as well, once I started working on this.

I can try to see what I can do here. But do you have any idea, why the search_tsv doesn't populate Chinese characters, if I run buildwatson for zh-cn?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/etianen/django-watson/pull/249#issuecomment-395727646, or mute the thread https://github.com/notifications/unsubscribe-auth/AAJFCAd85SHqlanrz2HRYwlA3tgAiDzpks5t6ln9gaJpZM4Ufsvi .

carlos22 commented 5 years ago

You need to iterate over all languages you have and create an index for them (i.e. with one column for each lang or even a whole table).

CuriousLearner commented 4 years ago

Hey @etianen

Wouldn't it be okay to keep the issue open so that if anyone wants to do refactor can do it, or otherwise, those who are searching for similar issue might find the issue in their search results?

etianen commented 4 years ago

It's still going to turn up in search results. But if nobody is working on it or paying attention to it, "closed" sounds about right to me.

On Wed, 5 Feb 2020 at 10:45, Sanyam Khurana notifications@github.com wrote:

Hey @etianen https://github.com/etianen

Wouldn't it be okay to keep the issue open so that if anyone wants to do refactor can do it, or otherwise, those who are searching for similar issue might find the issue in their search results?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/etianen/django-watson/pull/249?email_source=notifications&email_token=AABEKCFWHC7IRWV3OXFBK6LRBKKC3A5CNFSM4FD6ZPRKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEK263YQ#issuecomment-582348258, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABEKCEXU34RUQCPVS5J3B3RBKKC3ANCNFSM4FD6ZPRA .