CloudCannon / pagefind

Static low-bandwidth search at scale
https://pagefind.app
MIT License
3.22k stars 97 forks source link

Difference in the results by including `a` in the middle. #627

Open shubham-padia opened 1 month ago

shubham-padia commented 1 month ago

Thanks for this lightweight library for search :)

I'm currently prototyping an Astro/Starlight website which uses PageFind for its search. I've deployed a very rough prototype of migrating some existing documentation to https://zulip-help-center-search-starlight.netlify.app/.

When searching for the manage user search query, it gives the following results which are accurate. image

But, when searching for manage a user, the page is somehow not in the top 20 results. image

Looking at Starlight's codebase, it looks like they are using the library as-is. Pagefind version in Starlight: "pagefind": "^1.0.3".

Any help in explaining this behaviour will be appreciated, thanks!

shubham-padia commented 1 month ago

I've tried upgrading to pagefind v1.1.0 locally and the problem still persists.

bglw commented 1 month ago

Ah interesting. Yeah this does highlight an issue with the current ranking system, I'll look to improve this.

To remedy this, you should be able to crank the termSimilarity setting way up. e.g.

new PagefindUI({
    element: "#search",
    ranking: {
        termSimilarity: 10 // or 20 or 100 or 1000
    }
});
shubham-padia commented 1 month ago

Thanks for the reply! Increasing the setting made the results worse for me. In addition to that, increasing/decreasing the other settings also didn't help a lot.

But, I removed the stopwords in processTerm of the pageFind UI and it worked fine.

shubham-padia commented 1 month ago

Reoping after noticing this excerpt from #48

Reason №2 is to improve search ranking, so that searching for the editor cares more about editor than the. Removing stop words altogether is a heavy handed way of solving this, and the planned implementation of word ranking with BM25 will de-rank the the automatically without needing to put it on a list, while still keeping it in the index.