elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.91k stars 24.73k forks source link

Analyzers for Vietnamese? #6647

Closed nguyenchiencong closed 10 years ago

nguyenchiencong commented 10 years ago

Any analyzers for Vietnamese on the roadmap?

Thx and cheers

jpountz commented 10 years ago

Indeed we don't have a good tokenizer for vietnamese today. Although we would like to have one, vietnamese segmentation is quite hard so I'm afraid this won't be fixed anytime soon.

nguyenchiencong commented 10 years ago

Do you have by any chance a list of Vietnamese stopwords? thx

jpountz commented 10 years ago

No we don't.

anhtran commented 10 years ago

@jpountz How about this thing? https://github.com/CaoManhDat/VNAnalyzer It based on the research at http://mim.hus.vnu.edu.vn/phuonglh/tools/userguide-vnTokenizer.pdf I believe it can wrap about 80-90% cases in Vietnamese. That's good enough for searching.

jpountz commented 10 years ago

vnTokenizer is under GPL, which would be an issue for inclusion in Lucene or Elasticsearch. However, elasticsearch supports plugin-in custom analyzers so you could write a plugin that would expose this analyzer, see for instance https://github.com/elasticsearch/elasticsearch-analysis-kuromoji

nguyenchiencong commented 10 years ago

Thanks. For those wanting a vietnamese plugin, you guys can check out this one: https://github.com/duydo/elasticsearch-analysis-vietnamese

clintongormley commented 10 years ago

@nguyenchiencong you want to submit a PR adding this to the plugins page here: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-plugins.html#analysis-plugins ?

nguyenchiencong commented 10 years ago

@duydo is the author. I think we should ask him first. @duydo it would be great if you can do it.

duydo commented 10 years ago

Thanks @nguyenchiencong for mentioning the plugin.

@clintongormley It would be great if you can add the plugin to the plugins page. Thank you.

dripp1 commented 8 years ago

That plugin has issues with highlighting offsets, at least in version 2.2.0. Has anybody been able to use it or can recommend another Vietnamese plugin?

jayeshgoyal1995 commented 6 years ago

@duydo @jpountz : We are trying to support Vietnamese with SOLR. Is there any plugin available to integrate https://github.com/CaoManhDat/VNAnalyzer ? Also, Is there any other way we can add Vietnamese support in solr?