CloudCannon / pagefind

Static low-bandwidth search at scale
https://pagefind.app
MIT License
3.48k stars 113 forks source link

Feature request: don't index stop words #48

Closed Yoda-Soda closed 2 years ago

Yoda-Soda commented 2 years ago

Stop words are frequently occurring, insignificant words. By not indexing stop words we will be able allow for more precise Result List. This is a common approach in search tools. Would be good to also have this be configurable. As an extension of this feature it would be great if the stop words also handled multilingual sites. I think this feature would also improve the search performance as a side effect?

bglw commented 2 years ago

Any stop word handling is actually something I removed before release — I can talk through why.

Reason №1 that people implement stop words is to make the index smaller, which isn't too much of a concern for Pagefind due to the chunking strategy. Since chunks are roughly fixed sizes, we can afford to care about stop words.

Reason №2 is to improve search ranking, so that searching for the editor cares more about editor than the. Removing stop words altogether is a heavy handed way of solving this, and the planned implementation of word ranking with BM25 will de-rank the the automatically without needing to put it on a list, while still keeping it in the index.

The biggest reasons to continue indexing all words is that it improves our exact phrase matching — without stop words searching for "around the world" might return results with any other word in the middle. Also, it's hard to find stop word lists that are reliably safe to ignore — especially for documentation sites where a lot of common stop words are important to search for. The first list I used had date as a stop word, for example. Or maybe you want to search MDN for let.

In any case the word weighting work will provide everything we need around improving the search rankings, without harming search precision 🙂