Search: improved character/term matching esp for non-English

gibrown commented 5 years ago

There are a number of cases where our matching for non-English terms has problems. These are usually text analysis problems. Unfortunately we have to completely rebuild the index to fix these, so I am opening this issue to publicly track them and so that when we do rebuild the index we can make sure to address all of them. (Please edit and add to this list)

[ ] Cyrillic languages often use Latin characters that look similar. A good example is С (byte code 1057) and ASCII C (byte code 67). This came up trying to match "СУП" with "CУП" in Russian. Bulgarian and others would also have this problem.
[ ] "icu_folding wrongly modifies japanese characters, leading to a complete change in the meaning, for eg icu_folding of パリ returns ハリ". The first is the proper name for Paris, the second means “needle”
[ ] We don't currently have a custom analyzer for Ukranian but one exists
[ ] Bengali is the most spoken language for which we don't currently have a custom analyzer - https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html#bengali-analyzer
[ ] English often includes the apostrophe when stemming possessives. Maybe use https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-apostrophe-tokenfilter.html
[ ] The Korean analyzer would probably work better than our current CJK analyzer. Korean analyzer particularly bad for what we are doing for edgengrams.
[ ] For search-as-you-type fields we need to handle cases where short words can get filtered out: https://github.com/elastic/elasticsearch/issues/22478
[ ] Finnish would likely benefit from decompounding (as would German). Example: "eräpäivä" doesn't match "Eräpäivämuistutusten".

Robertght commented 5 years ago

One case: Shogun vs Shōgun results are different.

Kazuko-Nishimura commented 5 years ago

Still on Japanese language theme we have: Tokyo vs Tōkyō Sumo vs Sumō Ryu vs Ryū
... and a zillions more. It seems Google has resolved this issue by making Shogun = Shōgun (this is the language of a layman, apologies), what I mean is that they managed to interchange these two words. How far are you and how long will it take for this project to be completed? Thx

gibrown commented 5 years ago

@Robertght @Kazuko-Nishimura good examples. Do you know if these are cases where the blog language is set to ja? Or is this in English? I'm surprised if it happens in English (or most languages), but I can imagine that the Japanese tokenization is not doing the right thing for latin characters. Great example, thanks.

Kazuko-Nishimura commented 5 years ago

Hi Greg,

The language used in my blogs is supposed to be English. As I mentioned there are already many Japanese words that have been adopted into the English vocabulary; hence, if I want to force Shōgun, knowing that Shogun is the word listed in the English Oxford Dictionary, it can sound a bit pretentious and patronising, and this is not my intention. The website I am designing supports the novel I wrote, which is going to be published shortly. The book’s story spans the mythological era to medieval Japan. Therefore, I had a tough decision to make. Whilst words such as: Shogun, Tokyo, karate, Kyushu, are present in the modern-English vernacular; Ōkami, sonnō-jōi, and Chōshū are not. The dialogues of the characters of the book, are entirely between two or more Japanese persons (not a significant number of Europeans/Americans had reached Japan until late 19th century). Hence, it will not sound right if a character who lives in the time of samurai utters the words Shogun and Chōshū in a same sentence; which to an attentive eye might even sound a bit anachronic. Please also consider that these are words that cannot be translated into English. I am still thinking how I am going to resolve this issue. For the reasons mentioned above and for the sake of consistency, I would like to use the word Shōgun, however this decision might cause some detriment to the number of viewers of my site. I wonder what happens to languages like French, German or Portuguese; to mention the most obvious ones. What kind of interchangeable word can be employed to words such as ‘très’, ‘König’ and ‘São João’? (‘tres’, ‘Koenig’ and ‘San Joao’, perhaps?) Google does seem to have adopted a set of interest and smart rules to overcome these difficulties. Anyway, I will be grateful if you could let me know if you have any sort of estimate on when this change is coming. Please let me know if I can be of further assistance.

Best regards, Laura (Kazuko is my alias)

madeincosmos commented 5 years ago

In 1947780-zen the site was returning 0 results when site language was set to zh_CN, both for Chinese and English search terms. This didn't occur when site language was English. After disabling Jetpack search the site started returning results no matter the site language. Not 100% sure if that's the same issue so let me know if I should report it somewhere else.

stale[bot] commented 4 years ago

This issue has been marked as stale. This happened because:

It has been inactive in the past 6 months.
It hasn’t been labeled `[Pri] Blocker`, `[Pri] High`.

No further action is needed. But it's worth checking if this ticket has clear reproduction steps and it is still reproducible. Feel free to close this issue if you think it's not valid anymore — if you do, please add a brief explanation.

stale[bot] commented 4 years ago

This issue has been marked as stale. This happened because:

It has been inactive in the past 6 months.
It hasn’t been labeled `[Pri] Blocker`, `[Pri] High`.

No further action is needed. But it's worth checking if this ticket has clear reproduction steps and it is still reproducible. Feel free to close this issue if you think it's not valid anymore — if you do, please add a brief explanation.

mdbitz commented 2 years ago

A note here is that there is a limitation in the current infrastructure in support of utf8 (3 char) and utf8-mb4 (4 char). Our infrastructure persists data as latin1 which causes conversion/loss of some characters. This latin1 data is what is indexed to the elastic index. We need to review if this is contributing to non-matches of content.

gibrown commented 2 years ago

We do run some conversion code for going from the DB to UTF8 (ES only supports UTF8), but ya there could be some errors in these cases. I think every case I have seen has appeared caused by our indexing though.

gibrown commented 2 years ago

Some internal discussion of this in p3QzjZ-Sb-p2

Robertght commented 2 years ago

Another case in #4847028-zen. I recommended turning off the Jetpack Search module for now.

robfelty commented 1 year ago

We made lots of improvements on multi-lingual analysis in spring 2022. See https://jetpack.com/2022/07/01/jetpack-search-improvements-helping-your-visitors-find-more-of-what-they-are-looking-for/ I am marking this as complete. We can open a separate issue

github-actions[bot] commented 1 year ago

Support References

This comment is automatically generated. Please do not edit it.

[ ] 1947780-zen
[ ] 4847028-zen

Automattic / jetpack

Search: improved character/term matching esp for non-English #10146