BookStackApp / BookStack

A platform to create documentation/wiki content built with PHP & Laravel
https://www.bookstackapp.com/
MIT License
14.86k stars 1.87k forks source link

Chinese search cannot find words in the middle of a sentence. #778

Open jasoncheng7115 opened 6 years ago

jasoncheng7115 commented 6 years ago

For Bug Reports

When the word I'm looking for is the first word, or there's a space in front of it, it's ok. i01

But if the word is in the middle of a sentence, it cannot be found. i02

Whether this is a full-text retrieval of related issues?

Thanks!

alexwyl commented 5 years ago

The same problem in version: v0.25.1, I have just tried BookStack...

lotustalk commented 5 years ago

The same problem in version: v0.24.3, I use a docker

derky1202 commented 5 years ago

still the same problem in v26.4. hope it could get solved. thanks

sosize commented 4 years ago

you can use "成功" for search, maybe the word segmentation has the bug, hope fix it

LeonLiuY commented 4 years ago

Confirmed this issue still in v0.27.5 One of my team member is hesitating because of this. Would like to see it fixed.

hlj commented 4 years ago

Hope fix this issue soon.

ssddanbrown commented 4 years ago

Sorry about this issue. It essentially stems from my unfamiliarity with non-English text.

At the moment BookStack splits up page content, on certain characters such as spaces and some punctuation, into terms which are put in the database for indexing then a "Starts With" match of those are checked against on a normal search.

As @sosize has mentioned, you can wrap a search in quotes, at which point BookStack will perform a "contains" against the content directly instead of the above "Starts With". This is not the default simply due to performance. ("Starts With" searches can use indexes much more effectively than "Contains").

I'm not really sure how we could utilise the "Starts With" system for such characters. Perhaps the search should default to a "Contains" search if such characters are found in a term?

sosize commented 4 years ago

@ssddanbrown Can this be set as config control ?Select “Starts With” or “Contains” for search type.

More is hope full-text search.

Or how to quickly modify the code?

lishuai199502 commented 4 years ago

can i replace all the "startWith" with "contains",or how to modify the source code ,sorry ,i'm a noob

lishuai199502 commented 4 years ago

Hi,all the guys,I fixed this problem in v0.28.3.Just add a '%' in SearchService.php. In detail. in \app\Entities\SearchService.php,about line 196. modify $query->orWhere('term', 'like', $inputTerm . '%'); to $query->orWhere('term', 'like', '%'.$inputTerm . '%'); Just try.

0x9394 commented 4 years ago

@ssddanbrown hi, can above fix be merge to the source?
after modify SearchService.php now I can search both chinese and english in text body.

chimin-roh commented 2 years ago

(i'm korean and same problems occur) I know this issue closed, but i'll post some info in the hopes it will help others in the future. My bookstack version: v22.07.03

in\app\Entities\Tools\SearchRunner.php about 222 line and 281 line

※ can find middle term $query->orWhere('term', 'like', $inputTerm . '%'); to $query->orWhere('term', 'like', '%'.$inputTerm . '%');

※ can sort correctly $termQuery->orWhere('term', 'like', $term . '%'); to $termQuery->orWhere('term', 'like', '%'.$term . '%');

derky1202 commented 2 years ago

nice job. thanks

(i'm korean and same problems occur) I know this issue closed, but i'll post some info in the hopes it will help others in the future. My bookstack version: v22.07.03

in\app\Entities\Tools\SearchRunner.php about 222 line and 281 line

※ can find middle term $query->orWhere('term', 'like', $inputTerm . '%'); to $query->orWhere('term', 'like', '%'.$inputTerm . '%');

※ can sort correctly $termQuery->orWhere('term', 'like', $term . '%'); to $termQuery->orWhere('term', 'like', '%'.$term . '%');

charlietag commented 1 year ago

I've made a PR for to make it configurable in .env

ENHANCE_SEARCH_BAR_COMPATIBILITY=false

Hope I'm making it in the right way

4393

ssddanbrown commented 1 year ago

For me to properly look at addressing this, it would be useful if people could help me a little in understanding how the languages in question work. Apologies for my naivety on the subject.

charlietag commented 1 year ago

Hi @ssddanbrown, thanks for helping to solve non-English languages.

I hope the following will help you to understand what I try to solve

Assume senario like this

Pages

My cat likes to eat orange.
But I want him to drink juice

In chinese, it would be

我的貓喜歡吃橘子
但是我要他喝果汁

Database table (search_terms)

And in normal seaerch mode, the query is designed to be starts with, because each value in table column term only stores one vocabulary. So it's ok in English.

My          | page
cat         | page
likes       | page
to          | page
eat         | page
orange      | page
But         | page
I           | page
want        | page
him         | page
to          | page
drink       | page
juice       | page

In chinese, it would be stored in search_terms like this. And as you can see, column term stores multiple words in one value

我的貓喜歡吃橘子 | page
但是我要他喝果汁 | page

English vs Chinese

My       <---> 我的
cat      <---> 貓
likes to <---> 喜歡
eat      <---> 吃
orange   <---> 橘子
But      <---> 但是
I        <---> 我
want     <---> 要
him      <---> 他
to drink <---> 喝
juice    <---> 果汁

What we actually prefer

But I'm not sure this is a good design for indexing level.

我  | page
的  | page
貓  | page
喜  | page
歡  | page
吃  | page
橘  | page
子  | page
但  | page
是  | page
我  | page
要  | page
他  | page
喝  | page
果  | page
汁  | page

Re-design

I'm not good at indexing area. I have a question that why not just search from pages table using like '%term%'. And let database deal with index thing?

charlietag commented 1 year ago

Normal search

So if we search orange cat, in Chinese, it would be 橘子 貓.

And since Table "search_terms" contains nothing like 橘子 貓, I will get nothing.

And if I search for the following, it will failed:

What I hope it would be

I hope I can search things like above (failed part)

Exact search

I can use exact search to achieve purpose above.

But general users will not remeber to add quotes(") when search things

ssddanbrown commented 1 year ago

Thanks for the info @charlietag.

I have a question that why not just search from pages table using like '%term%'. And let database deal with index thing?

The database won't use indexes for queries like that. The search index is specifically built so prefix-based matching can be performed while making use of database indexes. Additionally contains matching in the context of how this are currently built would significantly increase the accidental matches of partial included terms, and therefore impact the scoring. Databases do often have fulltext indexes for "contains" search (Which BookStack used to use) but those have their own complications and there's a reason we moved away from things.

My intention has been to alter how we split the terms for indexing and search, for different character ranges, much like you've suggested, but I just want to better understand how searches and words translate in different languages, hence my last comment.


I would still like to invite others, particularly those using other Asian languages, to answer my previous comment.

10935336 commented 1 year ago

For me to properly look at addressing this, it would be useful if people could help me a little in understanding how the languages in question work. Apologies for my naivety on the subject.

I'm not a language expert. So this answer may not be entirely accurate.

  • In the Chinese language, does a single Chinese character generally map to what is a single word in latin based languages?
In modern Chinese, most words are written with two or more characters.
https://en.wikipedia.org/wiki/Chinese_characters

But there are also some cases where a single character maps to a single Latin word.

i <--> 我
my <--> 我的
myself <--> 我自己 or 我本人 or 本人 or 独自
dog <--> 狗
cloud <-->  云
car <-->  车
  • Is a single Chinese character generally the common unit for what would be searched?

A search for a Chinese character usually does not return useful results. But sometimes people still search for a single Chinese character like "cat“ ”"

Here are some searches recorded by google analytics on my website:

美好的每一天  <--> wonderful everyday(a video game title)
官网  <--> official website
宣传片  <--> promo video
巨构  <--> megastructure
指令  <--> command
文化  <--> culture
新用户  <--> new user
服务器  <--> server
添加  <--> add
猫  <--> cat
个人利益  <--> personal benefit
公共事件  <--> public event
雨  <--> rain
  • How would multiple terms be joined in a single query? For example, If I made the search query for orange cat in English, would the equivalent Chinese search query contain a space?

The words are not separated by spaces in Chinese, Japanese and Korea.
Unlike most languages, Chinese does not use spaces to separate characters into words.

When searching in Chinese, you would not use spaces to separate terms in a query. Instead, you would enter the characters for each term next to each other without spaces.

So usually search engines use a tokenizer to break a sentence into words:

"人人生而自由,在尊严和权利上一律平等"
“人人”, “生而”, “自由”, ",", "在", "尊严", "和", "权力", "上", "一律", "平等"
("all human beings", "born", "free", ",", "in", "dignity", "and", "rights", "on", "all", "equal")

"All human beings are born free and equal in dignity and rights"
"All human beings", "are born", "free", "and", "equal", "in", "digenity", "and", "rights"

In the example of orange cat, it can be an 橘猫 or 橘色猫 or 橘色的猫(orange color's cat).

orange  <--> 橘子(mandarin orange) or 橙子 or 橙色(orange color)
cat  <-->  猫
methoxymethane   <-->  二甲醚 or 甲氧基甲烷

two <--> 二
methyl ether <--> 甲醚

methoxy <--> 甲氧基
methane <--> 甲烷

oxy <--> 氧基
alkyl <--> 烷
`甲` can mean a shell or armor, which is the external protective layer of an animal or a person. In this case, it can be translated as shell or armor

`甲` can mean the first of the ten heavenly stems, which is the first symbol in the cycle of ten celestial stems. In this case, it can be translated as the first of the ten heavenly stems or simply A.

`甲` can mean the first party in a list or a contract, which is the one that comes first. In this case, it can be translated as first (in a list, as a party in a contract etc).




So there seems to be no easy way to segment words.

To be honest, it is very difficult to search Sino-Tibetan languages well. So many applications I have seen choose to use elasticsearch as their Search Engine.

Even in elasticsearch, many people are not satisfied with the official tokenizer and many other tokenizers have been created:

Update: This may be the solution you want. Jieba is a popular (32.7K star) Chinese word segmentation component, and this is its PHP ported version:

But it seems that jieba consumes a bit of memory, this module is more lightweight

matteotw commented 4 months ago

I also couldn't search Chinese words successfully. (English keywords are OK.) I have no experience about it, just guess it could be optimised through something like Asian language parser.

https://docs-develop.pleroma.social/backend/configuration/howto_search_cjk/

https://pgroonga.github.io/

kernelry commented 2 months ago

Version:v24.02.2 I think I solved the problem, Modify the code on line 213 of /var/www/BookStack/app/Search/SearchRunner.php: Before modification:

   210          $subQuery->where(function (Builder $query) use ($terms) {
   211              foreach ($terms as $inputTerm) {
   212                  $inputTerm = str_replace('\\', '\\\\', $inputTerm);
   213                  $query->orWhere('term', 'like', $inputTerm . '%');
   214              }
   215          });

only one result... image

After modification:

   210          $subQuery->where(function (Builder $query) use ($terms) {
   211              foreach ($terms as $inputTerm) {
   212                  $inputTerm = str_replace('\\', '\\\\', $inputTerm);
   213                  $query->orWhere('term', 'like', '%' . $inputTerm . '%');
   214              }
   215          });

have seven result! image

charlietag commented 1 month ago

Hi @kernelry

Actually, that's what I've proposed to author.
But he has his own consideration. For now we can only workaround.

Let's hope it will be fixed in the future version.

https://github.com/BookStackApp/BookStack/pull/4393