Closed nguyenchiencong closed 10 years ago
Indeed we don't have a good tokenizer for vietnamese today. Although we would like to have one, vietnamese segmentation is quite hard so I'm afraid this won't be fixed anytime soon.
Do you have by any chance a list of Vietnamese stopwords? thx
No we don't.
@jpountz How about this thing? https://github.com/CaoManhDat/VNAnalyzer It based on the research at http://mim.hus.vnu.edu.vn/phuonglh/tools/userguide-vnTokenizer.pdf I believe it can wrap about 80-90% cases in Vietnamese. That's good enough for searching.
vnTokenizer is under GPL, which would be an issue for inclusion in Lucene or Elasticsearch. However, elasticsearch supports plugin-in custom analyzers so you could write a plugin that would expose this analyzer, see for instance https://github.com/elasticsearch/elasticsearch-analysis-kuromoji
Thanks. For those wanting a vietnamese plugin, you guys can check out this one: https://github.com/duydo/elasticsearch-analysis-vietnamese
@nguyenchiencong you want to submit a PR adding this to the plugins page here: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-plugins.html#analysis-plugins ?
@duydo is the author. I think we should ask him first. @duydo it would be great if you can do it.
Thanks @nguyenchiencong for mentioning the plugin.
@clintongormley It would be great if you can add the plugin to the plugins page. Thank you.
That plugin has issues with highlighting offsets, at least in version 2.2.0. Has anybody been able to use it or can recommend another Vietnamese plugin?
@duydo @jpountz : We are trying to support Vietnamese with SOLR. Is there any plugin available to integrate https://github.com/CaoManhDat/VNAnalyzer ? Also, Is there any other way we can add Vietnamese support in solr?
Any analyzers for Vietnamese on the roadmap?
Thx and cheers