ariaghora / mpstemmer

Stemmer and lemmatizer for Indonesian (Bahasa Indonesia)
33 stars 3 forks source link

mpstemmer

Multi-phase stemmer: stemmer for Indonesian.

The base stemmer algorithm is based on Adriani et al. (2007), modified to work with both standard (baku) and nonstandard (tak baku) words.

Installation

pip install --upgrade git+https://github.com/ariaghora/mpstemmer.git

Usage

from mpstemmer import MPStemmer

stemmer = MPStemmer()

print(stemmer.stem('mengemudi')) # => kemudi
print(stemmer.stem('belajar')) # => ajar
print(stemmer.stem('ngelepas')) # => lepas
print(stemmer.stem('kebayang')) # => bayang

print(stemmer.stem_kalimat('ngelupain mantan tuh ngga susah kok bro'))
# => lupa mantan itu tidak susah kok bro

Performance comparison

Please refer to this page for an in-depth comparison against PySastrawi and other existing works.

Citation

@Misc{PrabonoMpstemmer2020,
title = {Mpstemmer: a multi-phase stemmer for standard and nonstandard Indonesian words},
author = {Prabono, Aria Ghora},
year = {2020},
url = {https://github.com/ariaghora/mpstemmer}
}

References

Known issues and limitations