angeloskath / php-nlp-tools

Natural Language Processing Tools in PHP
Do What The F*ck You Want To Public License
743 stars 152 forks source link

The PorterStemmer breaks on the word "ion" #39

Closed apmatthews closed 8 years ago

apmatthews commented 8 years ago

On line 375 of the PorterStemmer class, there is a check to see if the word you are stemming ends in "ion" using the ends() function. Ion, of course, is an english word. When you pass the word "ion" to the stemmer, it will throw a PHP error:

Uninitialized string offset: -1 in PorterStemmer.php on line 375

That is because the ends() function will determine that "ion" does in fact end in "ion" and thus sets $this->j = -1, which is the value of $this->k-$length where $this->k = 2 and $length = 3.

After the ends() check on line 375, there is an additional check to see if the character preceding the "ion" ending is either an "s" or a "t" by reading $this->b[$this->j]. However, because $this->j = -1, the character check will throw the aforementioned php error.

There are two apparent fixes. One would be to alter the ends() function to return false if the ending string ($s) is equal to the word ($this->k). I think that makes the ends() function more semantically accurate. Another solution would be to split line 375 into the following:

if ($this->ends("tion",3))
    break;
if ($this->ends("sion",3))
    break;`

I'm guessing this would slightly reduce efficiency, but would still solve the problem.

Any suggestions on which route to take? I'm happy to submit a pull request.

angeloskath commented 8 years ago

Hi,

Thanks for this find!

I believe it is best to change the ends() function so that it returns false if the word has the same length as the suffix we are checking for.

I think the best first step is to add the word 'ion' in the data/Stemmers/PorterStemmerTest/words.txt and ..../stems.txt.

It is a small fix and I would be very happy to merge a pull request if you submit one.

P.S.: Kudos for digging into one of the worst parts of the library and actually finding the solution as well.

angeloskath commented 8 years ago

Fixed by merging #40 in 677c3d760c7c738de6d436e710d27eba15d00c40