Open robert-zaremba opened 9 years ago
I started taking a look at this. The Polish Analyzer has the following components:
All of these are shared/existing components, with the exception of the StempelFilter and StempelStemmer. StempelFilter seems to just wrap StempelStemmer and ignore words not longer than 3 characters.
So, the majority of work is in porting the StempelStemmer, which is built around a separate package: org.egothor.stemmer. A copy of that package is part of Lucene here: https://github.com/apache/lucene-solr/tree/031964a148340e03564792cc8e3852a4bad577f1/lucene/analysis/stempel/src/java/org/egothor/stemmer
We have 2 options, we could just do a direct port of this. Or, we could understand it a bit better, and build something that works the same way, but reuse an existing Go implementation of the Trie.
Also, we need to look at potential licensing issues depending on how we do the port.
Moving to 1.x since adding support for new languages won't break the API.
Hey guys!
I am very interested in getting PL support into bleve and ultimately dgraph.
Anyone else keen on resurrecting the topic? How I can help?
I was using before pythonic stemmer which I believe shares its roots with the lucene one cited up here
I have ported the stempel stemmer here: https://github.com/blevesearch/stempel
It should be straightforward to wire this up into a bleve analyzer, I just never followed through with that part...
I have just created PR #1825
Hi, at dotGo2015 we've been talking about how to add a new language.
I was doing text processing of Polish language and can contribute here.