RocketChat / Rocket.Chat

The communications platform that puts data protection first.
https://rocket.chat/
Other
40.1k stars 10.36k forks source link

Chinese characters / words can't be searched #713

Open sunnipaul opened 9 years ago

sunnipaul commented 9 years ago

For example, searching for "汉字" gives no result even I've sent an message in a chat room containing it.

Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

sunnipaul commented 9 years ago

Similarly, I can't create a room with Chinese name.

marceloschmidt commented 9 years ago

Related to #640

robhawkins commented 9 years ago

TL;DR: Elastic Search or Lucene might be able to help here. This blog post gives a good explanation of the problems with searching Chinese and how Elastic can help.

I see two issues with searching non-latin text in RChat. One is that the current RChat can't search for Chinese characters (or anything much outside of [a-z0-9]). The other is that Chinese has no spaces, so the words require some segmentation for them to become searchable in an indexed full text search engine. For the second issue, I investigated a little bit to see if Mongo has implemented a segmenter for various languages yet or not. It turns out Chinese text search is in the Mongo Enterprise Edition but not yet available in free versions. It might be possible to add your own tokenizer. For instance, googling for "mongo full text custom tokenizer" turned up this blog post from 2010-11-14: Full text search with MongoDB and Lucene analyzers

Elastic Search, on the other hand, has a tokenizer that claims to work with several languages. This blog post from 2014-12-18 gives a run down on performance with Chinese: Efficient Chinese Search with Elastic Search . They also describe how to get the a better tokenizer, Paoding, to work with Elastic.

rodrigok commented 9 years ago

Yes, we need to implement a better search engine as an option, we implemented the search using internal mongodb's search engine to keep easy to install RC and allow users to search their messages, so we need to implement an way to allow users to add other search engines to solve their problems better.

We need help with this issue because this isn't our main focus now and we don't know much about other search engines.

steedos commented 9 years ago

Check this: Real-Time Search With MongoDB and Solr

http://geniuscarrier.com/real-time-search-with-mongodb-and-solr/

robhawkins commented 8 years ago

@rodrigok - I'd love to help. Implementing searchable text for other languages would be really fun and interesting. Do you all have other full time jobs? Or are you just focused on this?

@steedos - Interesting. I wonder what solution would be the simplest to implement and maintain. Would the Mongo+Solr option require rewriting the application to work with Solr? If so then I don't know if using Mongo+Solr or just Elastic would be simpler.

rodrigok commented 8 years ago

Hi @robhawkins, we are moving our main focus to Rocket.Chat

steedos commented 8 years ago

I just tried Elastic Search, It's good at Chinese document indexing and search. So we just need an admin settings to enable config Elastic Search and set server url.

And use Elastic Search, we can also index office documents and pdf files. I think it's important to attach office documents in chat rooms.

sunnipaul commented 8 years ago

@steedos It's would be great. Anyone want to do it?

FerminYang commented 8 years ago

@steedos It's would be great. Anyone want to do it?

+1

TwizzyDizzy commented 6 years ago

@rocket-cat close

Closing, since cannot be reproduced (trying what has been described in the first post) on 0.61.1 anymore.

Cheers Thomas

robhawkins commented 6 years ago

@TwizzyDizzy are Chinese characters searchable now? It's been awhile since I looked and would be interested to know. An example test would be to type in a sentence like 我的爸爸是最餓的 and then searching for any of those characters individually, or a word such as 爸爸.

Thanks!

TwizzyDizzy commented 6 years ago

The thing you described is not possible. At least not by simply putting that into the searchbox without any regular expression applied. But this goes for ASCII words as well: for example, send a message "autocarauto" and then search for "car". doesn't work either without regex.

On the other hand: typing 我的 爸爸 是最餓的 (including spaces) and searching for 爸爸 works.

Cheers Thomas

robhawkins commented 6 years ago

Chinese speakers do not use spaces in their writing (see zh wikipedia)

To make Rocket Chat friendly for Chinese speakers, I see two options,

  1. If regex is fast enough across a database full of messages, maybe just wrap Chinese character searches with * on either end
  2. If regex does not scale, perhaps this issue should remain open to track extending Rocket Chat for use by Chinese speakers. Without a fast (presumably indexed) history search, Rocket Chat it isn't as useful.
TwizzyDizzy commented 6 years ago

@rocket-cat open

Chinese speakers do not use spaces in their writing (see zh wikipedia)

I see! That makes things difficult indeed. I'll reopen then. Thanks for your feedback!

Cheers Thomas