CloudCannon / pagefind

Static low-bandwidth search at scale
https://pagefind.app
MIT License
3.24k stars 99 forks source link

Add an option to turn off density-weighting #534

Closed dscho closed 3 months ago

dscho commented 6 months ago

When searching, Pagefind applies a heuristic that often works quite well to boost pages with a higher density, i.e. a higher number of hits divided by the number of words on the page. This is called "density weighting".

In some instances (as pointed out here), it is desirable, though, to just use the number of hits directly, without dividing by the number of words on the page.

Let's support this via the search option use_weighting, which default to true to maintain the current behavior.

dscho commented 6 months ago

I added documentation and a test.

bglw commented 6 months ago

๐Ÿ‘‹ @dscho welcome! Thanks for jumping in โ˜บ๏ธ

I like the direction here, though before releasing this as a feature I'd actually like to go further with the controls โ€”ย for example, the function that calculates the balanced_score here imposes a certain style of ranking, and I'd love to expose numerical controls for how strong the word_length_bonus is, how the word_frequency modulates the score, etc. This would also be a great foundation for addressing some existing issues such as #437.

I'm imagining something like:

await pagefind.options({
  ranking: {
    word_distance: 0.5,
    site_frequency: 1.0,
    page_frequency: 1.0
  }
});

Turning page_frequency to 0.0 would achieve the same effect as use_weighting: false in this PR, but would also allow it to be tuned more finely up and down.

This isn't something I'm expecting you to do, I'm just brainstorming here for lack of a better place ๐Ÿ˜„ I'm on a soft break for a few more weeks, but wiring this up can be one of my initial jobs once I'm back on the tools.

In the meantime, I'm interested to hear if this has made a meaningful impact to your rankings, if you have given it a run through?

If not having this released is a blocker for you, I could merge this PR and release a tagged alpha version now, and then replace it with the implementation above once I have time. Let me know if that's required ๐Ÿ™‚

dscho commented 6 months ago

@bglw I tried my hand at what you suggested, but for some reason I cannot seem to even make the siteFrequency thing work, I must surely have tried to modify the wrong code for that, assuming that word_frequency would be the indicator (but it is 0 all the time for me).

Could you help me by guiding me to the code location that needs to be updated?

bglw commented 5 months ago

Hey @dscho โ€” I'm back on the wires this week so will look into this promptly and get back to you!

dscho commented 5 months ago

@bglw thank you so much!

bglw commented 5 months ago

Hello! Feedback ๐Ÿ™‚

Setting rankings

I'd like this to be an option set once, rather than for every search. So configured once with a:

await pagefind.options({
  ranking: { /* ranks */ }
});

Which every .search() uses thereafter. Also ideally this prevents the double call into WASM for every search, once to add the weights and once to run the search, and the rankings can just be stored in the SearchIndex struct we carry around.

Making siteFrequency work

The code looks correct here! It's just the test that has an issue. Site frequency can be interpreted as "number of pages that contain this word at least once" โ€” regardless of how many times that word is on each page โ€”ย and the test sets up two pages that both contain cat and dog โ€” so site frequency in this test isn't contributing anything. For that test to show something you'll want the target word to exist on <10% of pages on the site.

Generally

Numbers look good! Napkin-ed the calculations and it definitely looks like exposing these three toggles will work well.

I might change from the wasm_bindgen new this.backend.RankingWeights flow to something more akin to the other function signatures โ€” mainly to insulate the JavaScript calling conventions from the Rust types. But no need to worry about that, I'm happy to merge this with that and I'll change it at some point in the future.

I'll likely tweak the documentation wording after merge โ€” so also no stress there.

Overall looking good! Let me know any Qs, and as always let me know if/when you want to tag me in to finish things up. I'll also be more responsive now that the holiday period has passed ๐Ÿ˜

bglw commented 5 months ago

Oh, and one extra note on naming:

hu0p commented 5 months ago

Hey @bglw, do you think it would also make sense to consider minimum document length?

We have a site that is currently under development where we've employed weighting for different content types. However, some pages are very short. They're so short in fact that they're essentially short stubs of reference info. This has presented a unique challenge for term ranking because it really invalidates our ranking for certain terms. Let me provide an example:

There are roughly ten types of indexable content on the site, but "Case Types" AKA "Practice Areas" are by far the most important to us. Consequently, we've added the highest data-weight properties to those documents and included a progressively descending scale of data-weight values to other content types based on priority.

However, if you search for "car accidents", which is a case type with a lengthy associated document featuring "car accidents" as the h1, the search results provide a document from a different, lower-weighted content type as the top result. This is because this document contains a total of 14 indexable words (seen in the top header), one of which is "car accidents". Our desired top result is present in the results, but it's much, much lower down the results list. We'd like the current top result and similar results to continue to be indexed but at a lower priority reflected by the attached weight.

I'm not sure if any of the proposed attributes could really help here. It feels like what we really need is the ability to penalize or promote rankings based on a quality metric which includes document length in the calculus.

I'm happy to use this site as a test case. It has close to 1900 indexable pages and a little under 13 thousand words. Let me know your thoughts and any questions. If you feel that this is interesting, but is separate to the goals of this PR, just let me know and I'd be happy to move this discussion elsewhere as well.

bglw commented 4 months ago

๐Ÿ‘‹

@hu0p thanks for the feedback! I think this is definitely a good case to cover. Something that has been overdue for a while is changing the ranking from tf-idf to BM25, which helps with some of these issues around term saturation and page length. But it also might make sense to add a setting to adjust this further! I'll look into it.

@dscho would you like me to take the PR over from this point and finish it off?

hu0p commented 4 months ago

@bglw Thanks for getting back to me! I looked into BM25 and wow, yeah, that sounds perfect. BM25F may be even better for sites that are semantically well-formed too.

I ended up here due to interest in activity around pagefind_web/src/lib.rs and pagefind_web/src/search.rs. You're officially responsible for coercing an American into learning the term "Fossicking", so thanks for that. Let me know if there's anything I can do to help. I've been bouncing in and out of familiarizing myself with the codebase.

dscho commented 4 months ago

@dscho would you like me to take the PR over from this point and finish it off?

That would be wonderful!

bglw commented 4 months ago

On it! Aiming to have something this/next week, I'll just continue on this branch. Thinking I'll roll the BM25 change in which will change a few structural things, but result in a much better system.

dscho commented 4 months ago

On it! Aiming to have something this/next week, I'll just continue on this branch. Thinking I'll roll the BM25 change in which will change a few structural things, but result in a much better system.

Excellent @bglw ! Thank you so much!

bglw commented 3 months ago

Long time coming @dscho & @hu0p , but finally got this merged! It has changed a bit, so now Pagefind implements BM25 rankings, and provides roughly the previous controls plus the BM25 parameters.

It's available on the v1.1.0-rc1 version right now, if you want to take it for a spin and slip any feedback in before a release!

Documentation for the ranking configuration can be seen here: https://unreleased.pagefind.app/docs/ranking/ (it's now a top-level Pagefind option rather than a search parameter).

Need to do some housekeeping before a release, but I'll be looking to get v1.1.0 out in the coming week.

dscho commented 3 months ago

@bglw thank you so much!

I integrated this into my current work to migrate https://git-scm.com/ from a Heroku app to a static website hosted on GitHub Pages: https://git.github.io/git-scm.com/. It's pretty good! If you look at https://git.github.io/git-scm.com/search/results?search=log&language=en, you see that the git-log manual page is shown first.

I had to play a couple of tricks, though, such as adding a hidden <h3> with maximum data-pagefind-weight. It would be nice if there was a way to boost certain search terms on a given page without having to add some hidden text.

Another issue I have: when I typed log really fast in the search box of https://git.github.io/git-scm.com/, the first time, the order of the search results was quite off, it did not show git-log as first entry. Is it possible that debouncedSearch() is racy? Or maybe I am holding this thing wrong in https://github.com/git/git-scm.com/pull/1804/commits/68338ecf2f3edcfec06f4d918a0af52d6b898d0c?

bglw commented 3 months ago

Awesome! I'll release what we have now tomorrow.

I had to play a couple of tricks, though

Thankfully I do still have ideas for the future โ€”ย primarily the #532 and #437 issues which will help a lot. Essentially giving you a way to boost based on the title metadata, without it having to live in the data-pagefind-body area. No ETA on that but it's up the list, hopefully the current state is enough of an unblocker!

Is it possible that debouncedSearch() is racy?

It shouldn't be, at least I haven't seen it be so. The implementation is pretty simple.

Skimming through https://github.com/git/git-scm.com/commit/68338ecf2f3edcfec06f4d918a0af52d6b898d0c, I think there's a window where:

  1. You got results from debouncedSearch that were not null
  2. Your $("#search-results").html(...) is now running, which contains asynchronous loading of results with await e.data()
  3. While those are loading, a new debouncedSearch result comes through. also not null.
  4. $("#search-results").html(...) runs again, with different result fragments. These resolve really fast and the html is inserted onto the page
  5. Some slow request from 2. finally resolves, causing the html from the first search to be inserted onto the page.

Hopefully that makes sense. In essence, you'll want to assign that HTML template to a variable, and after it has awaited all of its result fragments you want to make sure that you're still in the latest debouncedSearch results.

bglw commented 3 months ago

Ranking changes released in v1.1.0 ๐ŸŽ‰

dscho commented 3 months ago

Ranking changes released in v1.1.0 ๐ŸŽ‰

Awesome, thanks for all your hard work @bglw !