angeloskath / php-nlp-tools

Natural Language Processing Tools in PHP
Do What The F*ck You Want To Public License
743 stars 152 forks source link

Different PorterStemmer result with other stemmers #43

Closed iranianpep closed 7 years ago

iranianpep commented 7 years ago

Hi,

Currently, I use Python nltk for tokenisation and stemming in my PHP project. However, since it's slow to call shell scripts from PHP I'm trying to use a PHP library instead. Based on my tests, php-nlp-tools returns different result comparing to the Python stemmer. For example if you stem says, sai is returned by stem() in PorterStemmer whereas http://text-processing.com/demo/stem/ returns say.

Do you have any idea about this difference? Other than this it's pretty fast.

Cheers, Ehsan

angeloskath commented 7 years ago

Hi,

Sorry for the kind of late reply, I hope you are having great holiday time.

I wrote the stemmer myself (trying to eek out a bit more performance removing regexes) modelling it directly from the ANSI C implementation from the author Martin Porter. If you compile the C programs in the above link or even if you try out the python implementation they all output sai for say.

In fact the implementation of the Porter stemmer in NlpTools is tested against approximately 23,000 words stemmed using the ANSI C implementation from Martin Porter.

If you read the porter.py from the nltk project you can see that they have made several extensions that improve the algorithm and in the future those extensions will be optional.

iranianpep commented 7 years ago

Hi,

No worries. I've made my decision to use it in my project: https://github.com/iranianpep/slackbot/

Cheers, Ehsan