Closed apmatthews closed 8 years ago
Hi,
Thanks for this find!
I believe it is best to change the ends()
function so that it returns false if the word has the same length as the suffix we are checking for.
I think the best first step is to add the word 'ion' in the data/Stemmers/PorterStemmerTest/words.txt and ..../stems.txt.
It is a small fix and I would be very happy to merge a pull request if you submit one.
P.S.: Kudos for digging into one of the worst parts of the library and actually finding the solution as well.
Fixed by merging #40 in 677c3d760c7c738de6d436e710d27eba15d00c40
On line 375 of the PorterStemmer class, there is a check to see if the word you are stemming ends in "ion" using the
ends()
function. Ion, of course, is an english word. When you pass the word "ion" to the stemmer, it will throw a PHP error:That is because the
ends()
function will determine that "ion" does in fact end in "ion" and thus sets$this->j = -1
, which is the value of$this->k-$length
where$this->k = 2
and$length = 3
.After the
ends()
check on line 375, there is an additional check to see if the character preceding the "ion" ending is either an "s" or a "t" by reading$this->b[$this->j]
. However, because$this->j = -1
, the character check will throw the aforementioned php error.There are two apparent fixes. One would be to alter the
ends()
function to return false if the ending string ($s
) is equal to the word ($this->k
). I think that makes theends()
function more semantically accurate. Another solution would be to split line 375 into the following:I'm guessing this would slightly reduce efficiency, but would still solve the problem.
Any suggestions on which route to take? I'm happy to submit a pull request.