Open asfimport opened 9 years ago
Dawid Weiss (@dweiss) (migrated from JIRA)
There actually is a dictionary-driven lemmatization engine in Lucene (for Polish). You could simply compile a dictionary for morfologik-stemming and reuse the same code.
In fact, this is how folks at the https://www.languagetool.org/ are using it (and they have support for multiple languages).
Erlend Garåsen (migrated from JIRA)
My patch including the lemmatizer and tests.
Erlend Garåsen (migrated from JIRA)
Thanks, I will take a look at it. The lemmatizer I have written will be used at University of Oslo, so this was my contribution back to Apache.
Dawid Weiss (@dweiss) (migrated from JIRA)
Lemmatisation is a tricky thing, especially for highly inflectional languages. There are technical issues (the dictionaries can get quite big; that's why morfologik-stemming uses an automaton to encode it efficiently) and non-technical issues (lemmatisation is typically combined with morfphological analysis to resolve disambiguities, otherwise it's not clear which lemma to pick for ambiguous surface forms).
Erlend Garåsen (migrated from JIRA)
This lemmatizer can do POS-tagging if it's enabled (and that the dictionary has information about word-classes). Ambiguous forms can either be indexed or reduced to one lemma. depending on how it is configured.
We have tested this lemmatizer by indexing 200,000 larger texts with a dictionary containing 700,000 entries. It does not take any longer time compared to one of the other available stemmers such as Hunspell.
I guess morphological analysis will be more time-consuming and require more memory at index time?
Dawid Weiss (@dweiss) (migrated from JIRA)
> Ambiguous forms can either be indexed or reduced to one lemma.
Sure, there's some sort of workaround for everything :) I'm not saying your contribution is bad or anything, I just said in general it's a tricky problem. The Polish dictionary in morfologik-stemming has 4,800,433 entries. That's 300mb of raw UTF8 where PoSs are highly ambiguous; most of it looks like this:
wracałyby wracać verb:pot:pl:m2.m3.f.n1.n2.p2.p3:ter:imperf:nonrefl+verb:pot:pl:m2.m3.f.n1.n2.p2.p3:ter:imperf:refl.nonrefl
The PoS tag is a Cartesian product of all the alternatives separated by dots...
The only way to achieve lemmatization today is to use the SynonymFilterFactory. The available stemmers are also inaccurate since they are only following simplistic rules.
A dictionary-based lemmatizer will be more precise because it has the opportunity to know the part of speech. Thus it provides a more precise method to stem words compared to other dictionary-based stemmers such as Hunspell.
This is my effort to develop such a lemmatizer for Apache Lucene. The documentation is temporarily placed here: http://folk.uio.no/erlendfg/solr/lemmatizer.html
Migrated from LUCENE-6254 by Erlend Garåsen Attachments: LUCENE-6254.patch