CloudCannon / pagefind

Static low-bandwidth search at scale
https://pagefind.app
MIT License
3.34k stars 100 forks source link

Improve ranking of exact matches in page titles #437

Open hirasso opened 11 months ago

hirasso commented 11 months ago

Hi there,

we just updated pagefind to 1.0.2 for the docs for swup and it's amazing! Thanks for all your work on this project.

Playing around with it, I noticed something that might or might not be possible to generalize: When I search for "Plugins", I'm getting the following results:

image

Intuitively I'd think that a page who's main heading (h1) exactly matches the search term ("Plugins", highlighted in the screenshot above) should be rated highest. Sure it's possible for us to manually give the "plugins" page a very high rating – but maybe there is a way for pagefind to get smart enough to return pages with an exact match in the main heading first?

I'd be happy to hear your opinion on that before we start implementing manual ranking.

bglw commented 11 months ago

Ah! Interesting.

So first for reasoning on why the results are in that order: headings are given very strong priority — though /api/properties/ contains an h2 element of plugins, which is also ranked very favorably, causing it to appear so highly in the rankings. The other reason is that the page is so short — term frequency plays a big role in ranking, so a short page with multiple matching words will rank better than a long page with a lower density of words.

Intuitively I'd think that a page who's main heading (h1) exactly matches the search term should be rated highest

On an intuitive level I agree! The trouble is that currently, with the data Pagefind has on hand at the time of ranking, this isn't known. By the time that level of data is loaded into the front end, the rankings are locked in.

When ranking, all we see for these results is:

// Page A word match locations
[{
  "weight": 6,
  "location": 23
}, {
  "weight": 1,
  "location": 27
}]

// Page B word match locations
[{
  "weight": 7,
  "location": 0
}, {
  "weight": 1,
  "location": 12
}, {
  "weight": 1,
  "location": 21
},
/* -- more -- */
]

That weight: 7, location: 0 word is the h1, but we don't know that it's the only word in the h1.


I'm sure there's a creative solution here, but nothing immediately comes to mind.

One option would just be to bump h1 elements default weighting across the board to compensate, but I'd be wary of that impacting other sites in the wrong direction, if they're currently ranking well.

I do want to expose a new configuration option for mapping element selectors to custom weightings, meaning the default h1 rank could be a per-site implementation if people want to tailor their results. But this doesn't solve the "making Pagefind smart enough" goal of doing better by default.

Another option is finding some way to opt h1 elements (or generally high-ranked words) out of the term frequency penalty — but I'll need to think on that further.


Sorry for the essay! Since I have no immediate bright ideas, I think implementing manual ranking is a good first step.

(Changing your h1 elements to data-pagefind-weight="10" would definitely push Plugins into at least second place).

hirasso commented 11 months ago

No need to apologize, I love the essay! 😄 ...very interesting to get to know more about pagefind's internals.

From my experience with implementing search, it's very possible that too many built-in assumptions ("smartness") could hurt more then help. After giving it more thought I realized that maybe even the assumption that a perfect match in the h1 should be ranked highest doesn't hold true. What about if we change the title of the page to "Plugin Ecosystem" or "Plugins Overview" sometime in the future? This would immediately break the ranking again.

From other search engines I know the concept of "pinning" pages to the very top for specific search terms. Maybe that could be a feature idea for pagefind, as well.

Something like data-pagefind-pin="plugin,plugins" or even with regex support: data-pagefind-pin="plugins?" could pin the page "Plugins" to the very top if users would search for "plugin" or "plugins".

It could also be a meta tag in the <head> of the page, like e.g.:

<!-- pin the page to the top when searching for "plugin" or "plugins": -->
<meta name="pagefind:pin" content="plugin,plugins">

Just an idea without knowing if this could be feasible with the architecture of pagefind 🙂

daun commented 11 months ago

Given the implementation details kindly explained by @bglw, I'd opt for manual ranking here. Intuitively, for most sites, ranking pages the highest that have an exact match between search term and h1/title makes sense, but if that's not encoded in the ranking data, ranking them during display should work as well. I can imagine treating the h1 as a special title property of a page apart from the other contents, as that's how they're handled in display as well. But that's probably opening a whole different can of worms...

bglw commented 11 months ago

Good directions to think about, thanks to both of ya 🙏

I need to brush up CloudCannon's documentation and search, so that might prove a good place to experiment with some pinning or special casing on a site I'm familiar with. Will update here if I land on anything!

bglw commented 10 months ago

Some new thoughts on this; I think I'm going to try get metadata into the index in a way that it can be queried as part of a search. This would allow you to do a freeform search for the word plugins, or run a search specifically for title metadata containing the word plugin, or some combination of the two.

Still more ideation needed, but I think having that data combined with exposing more configuration on how it is used when ranking will allow people to tailor search to their site content.

clydebarrow commented 3 months ago

My request is not to exactly match titles, but filenames:

It would be nice IMHO if a search that exactly matched the name of a page would deliver that page as the top hit. E.g. searching for "I2C" would put a page called "I2C.html" at the top (without case sensitivity of course.)

Are filenames even used at present?

bglw commented 3 months ago

Filenames are currently unused apart from building URLs, but the URLs are present in the result fragments, so what you're after would still be solved by #532, as it is the precursor to matching files based on any "non-content" fields

quadratecode commented 3 months ago

I agree that matches in page titles are not adquately weighed. After tinkering with all available parameters I could not get exact matches in the title tag with data-pagefind-weight="10" to show up on the top of the results list. Here is an example:

image

There are also some other rankings which I do not understand - e.g. here:

image

In the above image, the first result has 6 matches while the second one has 49 matches. Changing the parameters did not seem to do much.

If you want, you can try it yourself here: https://www.zhlaw.ch/

Anyways - just also wanted to say: Amazing project, thank you so much for your work and offering this under a permissible license! 🙏

sonderant commented 13 hours ago

I agree that the exact match in the title should be given more priority. For example, in my site, I have some articles (360) that begin with the word and number, e.g. Article 1, Article 206, etc. When searching with Pagefind, any article between 1 and 30 does not show up, even in the top ten. For example, "Article 1" (no quotes) returns all articles in the 100-199 range, or articles with the word "article" or the number 1 in them. But any article past number 36 shows up in the top ten.