Dictionary-based lemmatizer [LUCENE-6254]

apache / lucene

Apache Lucene open-source search software

https://lucene.apache.org/

Apache License 2.0

2.73k stars 1.05k forks source link

Dictionary-based lemmatizer [LUCENE-6254] #7316

Open asfimport opened 9 years ago

asfimport commented 9 years ago

The only way to achieve lemmatization today is to use the SynonymFilterFactory. The available stemmers are also inaccurate since they are only following simplistic rules.

A dictionary-based lemmatizer will be more precise because it has the opportunity to know the part of speech. Thus it provides a more precise method to stem words compared to other dictionary-based stemmers such as Hunspell.

This is my effort to develop such a lemmatizer for Apache Lucene. The documentation is temporarily placed here: http://folk.uio.no/erlendfg/solr/lemmatizer.html

Migrated from LUCENE-6254 by Erlend Garåsen Attachments: LUCENE-6254.patch

asfimport commented 9 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

There actually is a dictionary-driven lemmatization engine in Lucene (for Polish). You could simply compile a dictionary for morfologik-stemming and reuse the same code.

In fact, this is how folks at the https://www.languagetool.org/ are using it (and they have support for multiple languages).

asfimport commented 9 years ago

Erlend Garåsen (migrated from JIRA)

My patch including the lemmatizer and tests.

asfimport commented 9 years ago

Erlend Garåsen (migrated from JIRA)

Thanks, I will take a look at it. The lemmatizer I have written will be used at University of Oslo, so this was my contribution back to Apache.

asfimport commented 9 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

Lemmatisation is a tricky thing, especially for highly inflectional languages. There are technical issues (the dictionaries can get quite big; that's why morfologik-stemming uses an automaton to encode it efficiently) and non-technical issues (lemmatisation is typically combined with morfphological analysis to resolve disambiguities, otherwise it's not clear which lemma to pick for ambiguous surface forms).

asfimport commented 9 years ago

Erlend Garåsen (migrated from JIRA)

This lemmatizer can do POS-tagging if it's enabled (and that the dictionary has information about word-classes). Ambiguous forms can either be indexed or reduced to one lemma. depending on how it is configured.

We have tested this lemmatizer by indexing 200,000 larger texts with a dictionary containing 700,000 entries. It does not take any longer time compared to one of the other available stemmers such as Hunspell.

I guess morphological analysis will be more time-consuming and require more memory at index time?

asfimport commented 9 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

> Ambiguous forms can either be indexed or reduced to one lemma.

Sure, there's some sort of workaround for everything :) I'm not saying your contribution is bad or anything, I just said in general it's a tricky problem. The Polish dictionary in morfologik-stemming has 4,800,433 entries. That's 300mb of raw UTF8 where PoSs are highly ambiguous; most of it looks like this:

wracałyby wracać verb:pot:pl:m2.m3.f.n1.n2.p2.p3:ter:imperf:nonrefl+verb:pot:pl:m2.m3.f.n1.n2.p2.p3:ter:imperf:refl.nonrefl

The PoS tag is a Cartesian product of all the alternatives separated by dots...