apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.73k stars 1.05k forks source link

Add dynamic stemmer for Ukrainian [LUCENE-7348] #8402

Open asfimport opened 8 years ago

asfimport commented 8 years ago

We're adding a dictionary based lemmatizing analyzer for Ukrainian in #8342. It would be nice to have a dynamic stemmer that can handle words that are not in the dictionary.


Migrated from LUCENE-7348 by Andriy Rysin Linked issues:

asfimport commented 8 years ago

Andriy Rysin (migrated from JIRA)

@mikemccand Hey Michael, I've analyzed the inflection rules we have in dict_uk project (https://github.com/arysin/dict_uk) and it has \~4500 inflection rules (most of those are simple match but some are regexps). Those rules cover almost all possible affixes. I can probably drop rare and homonimic ones to make it below 4k but then the question comes up where to go next? 1) having all the rules would be nice as it'll provide high accuracy and high level of compatibility with the dictionary-based lemmatizer created in #8342 (we could probably even make a hybrid solution) 2) having smaller/simpler will benefit the performance (but to simplify it properly we would have to analyze the frequency/importance of each rule) 3) is lemmatizing analysis good or stemming is preferred? for real stemming we would have to work more on the rules to find the (pseudo)roots for each inflection rule

I tried to look at existing light stemmers and many are very basic. It looks like we're going in reverse and I am trying to understand if already having complex solution we want to make it simpler (it looks that the only benefit will be performance)? I also tried to google on how to do the stemming "right" but nothing serious jumped at me especially applicable for Slavic languages.

Thanks.