CloudCannon / pagefind

Static low-bandwidth search at scale
https://pagefind.app
MIT License
3.22k stars 97 forks source link

H1 elements not being indexed #615

Closed clydebarrow closed 1 month ago

clydebarrow commented 1 month ago

I'm trying to implement Pagefind on a site to improve the search, and it does not seem to be indexing the content of H1 elements.

The site is browsable here:

https://esphome-docs.web.app/index.html

The screenshot below shows searching for "Automations and Templates" and it is clearly ignoring the H1 header on the current page.

According to the docs, H1 headers should rank well above body text.

I suspect, though have not confirmed, that other Hx elements are also not being indexed.

I'm using Pagefind 1.1.0.

Screenshot 2024-05-15 at 12 18 14 PM

bglw commented 1 month ago

👋 hey @clydebarrow

The ranking here does seem worse than I would expect by default, though it is indexing the headings, they're just being lost in the soup of results a bit.

This is where tweaking the ranking parameters would help. I can see in the site source you have configuration for this, but it's currently not being applied since it isn't inside a ranking object. Currently you have:

    window.addEventListener('DOMContentLoaded', (event) => {
        new PagefindUI({
            element: "#search",
            showSubResults: true,
            pageLength: 0.0,
            termSaturation: 0.8,
            termFrequency: 0.4,
            termSimilarity: 1.0
        });
    });

But that should be:

    window.addEventListener('DOMContentLoaded', (event) => {
        new PagefindUI({
            element: "#search",
            showSubResults: true,
            ranking: {
              pageLength: 0.0,
              termSaturation: 0.8,
              termFrequency: 0.4,
              termSimilarity: 1.0
            }
        });
    });

Additionally, from a quick test, these parameters seem to do a little better:

pageLength: 0.0,
termSaturation: 1.6, // raised this value to favor the high-density pages
termFrequency: 0.4,
termSimilarity: 6.0 // raised this value to trim out some shorter word stems from muddying results
clydebarrow commented 1 month ago

but it's currently not being applied since it isn't inside a ranking object

Reminds me again why I dislike Javascript :-(

So with that fixed and your suggestions applied, it now seems to be working well. Thanks!