blevesearch / bleve

A modern text/numeric/geo-spatial/vector indexing library for go
Apache License 2.0
10.09k stars 686 forks source link

Polish language support #275

Open robert-zaremba opened 9 years ago

robert-zaremba commented 9 years ago

Hi, at dotGo2015 we've been talking about how to add a new language.

I was doing text processing of Polish language and can contribute here.

mschoch commented 9 years ago

I started taking a look at this. The Polish Analyzer has the following components:

https://github.com/apache/lucene-solr/blob/031964a148340e03564792cc8e3852a4bad577f1/lucene/analysis/stempel/src/java/org/apache/lucene/analysis/pl/PolishAnalyzer.java#L139-L148

All of these are shared/existing components, with the exception of the StempelFilter and StempelStemmer. StempelFilter seems to just wrap StempelStemmer and ignore words not longer than 3 characters.

So, the majority of work is in porting the StempelStemmer, which is built around a separate package: org.egothor.stemmer. A copy of that package is part of Lucene here: https://github.com/apache/lucene-solr/tree/031964a148340e03564792cc8e3852a4bad577f1/lucene/analysis/stempel/src/java/org/egothor/stemmer

We have 2 options, we could just do a direct port of this. Or, we could understand it a bit better, and build something that works the same way, but reuse an existing Go implementation of the Trie.

Also, we need to look at potential licensing issues depending on how we do the port.

mschoch commented 8 years ago

Moving to 1.x since adding support for new languages won't break the API.

wkhere commented 3 years ago

Hey guys!

I am very interested in getting PL support into bleve and ultimately dgraph.

Anyone else keen on resurrecting the topic? How I can help?

I was using before pythonic stemmer which I believe shares its roots with the lucene one cited up here

mschoch commented 3 years ago

I have ported the stempel stemmer here: https://github.com/blevesearch/stempel

It should be straightforward to wire this up into a bleve analyzer, I just never followed through with that part...

nickspring commented 1 year ago

I have just created PR #1825