Closed gibrown closed 1 year ago
One case: Shogun vs Shōgun results are different.
Still on Japanese language theme we have:
Tokyo vs Tōkyō
Sumo vs Sumō
Ryu vs Ryū
... and a zillions more. It seems Google has resolved this issue by making Shogun = Shōgun (this is the language of a layman, apologies), what I mean is that they managed to interchange these two words. How far are you and how long will it take for this project to be completed? Thx
@Robertght @Kazuko-Nishimura good examples. Do you know if these are cases where the blog language is set to ja? Or is this in English? I'm surprised if it happens in English (or most languages), but I can imagine that the Japanese tokenization is not doing the right thing for latin characters. Great example, thanks.
Hi Greg,
The language used in my blogs is supposed to be English. As I mentioned there are already many Japanese words that have been adopted into the English vocabulary; hence, if I want to force Shōgun, knowing that Shogun is the word listed in the English Oxford Dictionary, it can sound a bit pretentious and patronising, and this is not my intention. The website I am designing supports the novel I wrote, which is going to be published shortly. The book’s story spans the mythological era to medieval Japan. Therefore, I had a tough decision to make. Whilst words such as: Shogun, Tokyo, karate, Kyushu, are present in the modern-English vernacular; Ōkami, sonnō-jōi, and Chōshū are not. The dialogues of the characters of the book, are entirely between two or more Japanese persons (not a significant number of Europeans/Americans had reached Japan until late 19th century). Hence, it will not sound right if a character who lives in the time of samurai utters the words Shogun and Chōshū in a same sentence; which to an attentive eye might even sound a bit anachronic. Please also consider that these are words that cannot be translated into English. I am still thinking how I am going to resolve this issue. For the reasons mentioned above and for the sake of consistency, I would like to use the word Shōgun, however this decision might cause some detriment to the number of viewers of my site. I wonder what happens to languages like French, German or Portuguese; to mention the most obvious ones. What kind of interchangeable word can be employed to words such as ‘très’, ‘König’ and ‘São João’? (‘tres’, ‘Koenig’ and ‘San Joao’, perhaps?) Google does seem to have adopted a set of interest and smart rules to overcome these difficulties. Anyway, I will be grateful if you could let me know if you have any sort of estimate on when this change is coming. Please let me know if I can be of further assistance.
Best regards, Laura (Kazuko is my alias)
In 1947780-zen the site was returning 0 results when site language was set to zh_CN
, both for Chinese and English search terms. This didn't occur when site language was English. After disabling Jetpack search the site started returning results no matter the site language. Not 100% sure if that's the same issue so let me know if I should report it somewhere else.
This issue has been marked as stale. This happened because:
No further action is needed. But it's worth checking if this ticket has clear reproduction steps and it is still reproducible. Feel free to close this issue if you think it's not valid anymore — if you do, please add a brief explanation.
This issue has been marked as stale. This happened because:
No further action is needed. But it's worth checking if this ticket has clear reproduction steps and it is still reproducible. Feel free to close this issue if you think it's not valid anymore — if you do, please add a brief explanation.
A note here is that there is a limitation in the current infrastructure in support of utf8 (3 char) and utf8-mb4 (4 char). Our infrastructure persists data as latin1 which causes conversion/loss of some characters. This latin1 data is what is indexed to the elastic index. We need to review if this is contributing to non-matches of content.
We do run some conversion code for going from the DB to UTF8 (ES only supports UTF8), but ya there could be some errors in these cases. I think every case I have seen has appeared caused by our indexing though.
Some internal discussion of this in p3QzjZ-Sb-p2
Another case in #4847028-zen. I recommended turning off the Jetpack Search module for now.
We made lots of improvements on multi-lingual analysis in spring 2022. See https://jetpack.com/2022/07/01/jetpack-search-improvements-helping-your-visitors-find-more-of-what-they-are-looking-for/ I am marking this as complete. We can open a separate issue
Support References
This comment is automatically generated. Please do not edit it.
There are a number of cases where our matching for non-English terms has problems. These are usually text analysis problems. Unfortunately we have to completely rebuild the index to fix these, so I am opening this issue to publicly track them and so that when we do rebuild the index we can make sure to address all of them. (Please edit and add to this list)
С
(byte code 1057) and ASCIIC
(byte code 67). This came up trying to match "СУП" with "CУП" in Russian. Bulgarian and others would also have this problem.