dzieciou / pystempel

Python port of Stempel, an algorithmic stemmer for Polish language.
Other
33 stars 5 forks source link

Weird stemming with many words #7

Closed abhishekkrthakur closed 4 years ago

abhishekkrthakur commented 5 years ago

The stemmer has a weird behavior with quite a lot of words:

joyce -> ąć wielce -> ąć piwko -> ąć royce -> ąć pip -> ąć

To reproduce:

In [1]: from stempel import StempelStemmer

In [2]: stemmer = StempelStemmer.default()

In [3]: for word in ['joyce', 'wielce', 'piwko', 'royce', 'pip']:
   ...:     print(stemmer.stem(word))
   ...:     
ąć
ąć
ąć
ąć
ąć
dzieciou commented 5 years ago

That's the limitation of this stemmer. The accuracy of the stemmer depends on the rules it has learned and the technique itself.

The original stemmer has similar limitations:

import tests.base as base
import os
pwd = os.path.dirname(os.path.abspath(__file__))
stemmer_table_fpath = os.path.join(pwd, '..', 'stempel', 'stemmer_20000.tbl')
jar_fpath = os.path.join(os.getcwd(), 'stempel-8.1.1.jar')
dict_fpath = os.path.join(pwd, 'sjp_dict.txt')
java_stemmer = base.get_java_stemmer(stemmer_table_fpath, jar_fpath)

for word in ['joyce', 'wielce', 'piwko', 'royce', 'pip']:
    print(java_stemmer.stem(word).toString())
ąć
ąć
ąć
ąć
ąć

I am not surprised it does not stem non-Polish words like 'yoice' or 'royce', although I would expect it to return the original word if it does not match anything known. I am particularly surprised it cannot handle words like 'wielce', 'piwko', 'pip', as those seems to be forms of Polish words. I will try to talk to the original author of the stemmer to see if anything can be done about that and come back to you with a response.

If you're looking for a solution with a higher accuracy I would try looking for a lemmatizer that actually finds for base form of a word using dictionary and morphosyntactic analysis. I am actually trying to refactor one of academic solutions for that purpose, as I could not find anything working for Polish.

abhishekkrthakur commented 5 years ago

@dzieciou Thank you for the reply. In cases like this the stemmer should return the original word instead, as you mentioned. That’s what nltk stemmers do too. A simple way of doing this would be to check if the first character (or first couple characters) of the stemmed word ate the same as that of the original word. What do you think?

dzieciou commented 5 years ago

@abhishekkrthakur I do agree although I am not an expert in stemmers.

dzieciou commented 4 years ago

@abhishekkrthakur Here's the answer to your concern from the Lucene mailing list: https://lucene.472066.n3.nabble.com/Limitations-of-StempelStemmer-td4449581.html. Lucene is where I have taken original version of the stemmer from.

As Dawid Weiss suggests pretraining on a large dataset might be an option. You may read the links provided by Dawid or contact the original author, Andrzej Białecki, instead.

Alternatively, you may want to use a lemmatizer. I'm working on one here: https://github.com/dzieciou/lemmatizer-pl. Note, it is slower than stemmer and thus may better fit offline jobs rather than real time/online applications.

dzieciou commented 4 years ago

This has been addressed in PR: https://github.com/dzieciou/pystempel/pull/10. Closing.