clips / pattern

Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.
https://github.com/clips/pattern/wiki
BSD 3-Clause "New" or "Revised" License
8.76k stars 1.58k forks source link

Is Porter stemmer working correctly? #225

Open ni9elf opened 6 years ago

ni9elf commented 6 years ago

I wrote a Python script to check the output of pattern's implementation of the Porter2 stemmer (in the vector module) against the output of the original implementation by Martin Porter.

Martin Porter provides a test input vocabulary of 29417 words and corresponding stemmed outputs of these words obtained from his implementation of the stemmer. My script compares the output of pattern's own Porter stemmer implementation with the output of the original implementation. A total of 215 errors were found. These errors are stored in the file errors.txt by my script available here. Sample preview:

word_input original_output pattern_output
aimlessly aimless aimlessli
gazelle gazell gazel
narratives narrat narr

Pattern implements the Porter stemmer in the vector module which can be used by first importing, from pattern.vector import stem, PORTER, and then running stem(input, stemmer=PORTER). My code is available here: https://github.com/ni9elf/PatternClipsExperiments