blevesearch / bleve

A modern text/numeric/geo-spatial/vector indexing library for go
Apache License 2.0
10.1k stars 686 forks source link

implement Dictionary based Compound Word TokenFilter #115

Open mschoch opened 10 years ago

mschoch commented 10 years ago

See https://lucene.apache.org/core/4_4_0/analyzers-common/org/apache/lucene/analysis/compound/DictionaryCompoundWordTokenFilterFactory.html

This would be useful for languages like German, Swedish, and others that commonly have compound words, and users should be able to search for the consituent words.

mschoch commented 10 years ago

We should also consider adding support for the hyphenation-based approaches as well.

See http://lucene.apache.org/core/4_10_2/analyzers-common/org/apache/lucene/analysis/compound/package-summary.html

dgryski commented 9 years ago

The current state-of-the-art for German decompounding appears to be https://dl.acm.org/citation.cfm?id=1787593 , with a brief description in http://www.aclweb.org/anthology/P08-2064