MihaiValentin / lunr-languages

A collection of languages stemmers and stopwords for Lunr Javascript library
Other
431 stars 163 forks source link

Stop word lists should be sorted. #8

Closed olivernn closed 9 years ago

olivernn commented 9 years ago

Fixes #7

Stop word filters use a lunr.SortedSet which requires its elements to be maintained in a sorted array. This change ensures that the stop words used to load the sorted set are already sorted.

If the stop words are not presorted then the stop word filter will not filter all the required stop words.

I have only changed the build script, not included in this commit is the results of running the build, though it did work for me locally. Let me know if you want me to add the results of running the build script to this pull request.

MihaiValentin commented 9 years ago

Hi @olivernn.

Thanks for putting in the effort to investigate this issue.

I've patched my local build.js file with your change (.sort()). Though it fixes the initial issue, it seems to mess up the german stopwords, when the sort outputs something like this (zu zum zur zwar zwischen über). It seems über is after words starting with z, which is wrong.

Some stopwords from which I build the languages files are sorted, and some are not. I can think of two solutions:

Personally I would prefer the first solution, since then stop-words can be added anywhere without any issues.

@olivernn Any feedback regarding this?

olivernn commented 9 years ago

Regarding the sorting, I think that as long as the sorting is consistent, even if we think it is wrong, it shouldn't matter. As long as sort() understands the same ordering as lunr.SortedSet (and I think it does) then the binary search implementation of lunr.SortedSet will still work correctly and all added stop words will be picked up. You could add a test, that with the current sorting, all the stop words are reported as present in a lunr.SortedSet to verify this.

Alternatively perhaps lunr.SortedSet is the wrong data structure to store the stop words. A hash would probably be a better data structure, the stop word filters are always only asking whether a certain token is a stop word and they never care about the whole set of stop words, let alone the ordering of them. I don't know why I implemented the english stop word filter using the sorted set, I'll likely change this in an upcoming release.

MihaiValentin commented 9 years ago

Hi @olivernn ,

Indeed, the sort function behaves the same as the < and > operators from the lunr.SortedSet. So it is safe to proceed with sorting. I've taken in your change, rebuilt the full and minified languages.

Please let me know when you will change the stopWords from sorted set to another data structure so I can update accordingly.

Thanks!