Search index based on headers

kylebutts commented 1 year ago

Hi there!

It's quite common to have headers with id's for linking to subsections of a documentation page. I'm wondering if it's possible to have the search index break up the index by headers?

Here's an example of what I'm talking about. See how the search shows you sections within the page https://pkgdown.r-lib.org/articles/search.html CleanShot 2023-02-06 at 14 01 55@2x

bglw commented 1 year ago

Hi @kylebutts 👋

Not at the moment, but it's a great suggestion (that I have heard before, though I can't find an existing issue)

This looks like something that could definitely be implemented. I'll spitball two ways one could configure this, either in some automatic way, or using an attribute.

Automatic

This would be some sort of config like ^{(option pending)}

# pagefind.yml
split_pages_on: "h2"

Which would then do some ✨ magic ✨ to produce the desired result. The main concern here is if the ✨ magic ✨ doesn't suit a particular user, and isn't customizable enough.

Attribute

This would be a new attribute like ^{(syntax pending)}

<div data-pagefind-subpage="#getting-started">
  <h2 id="getting-started">Getting started</h2>
  <p>. . . </p>
</div>

or ^{(also syntax pending)}

<h2 data-pagefind-subpage id="getting-started">Getting started</h2>
<p>. . . </p>

This would provide more control over the indexing behavior, but doesn't suit people who can't add attributes here (i.e. the entire page content goes through a | markdownify filter in an SSG that doesn't have hooks.)

Keen to hear your thoughts on configuration and how you would ideally set this up. There are also some corner cases I can forsee, but I'll let those simmer until a direction lands.

kylebutts commented 1 year ago

Hi @bglw! Pleasure to meet you.

I assume that the index looks something like: each index.html page (or data-pagefind-body attr) has a set of words that are found on that page.

If you were to look at a page and find all headers with id attributes, then you could "split" the page into chunks by content before and after each id'd header. The result.data() call could append the id to the url. I think most frameworks allow remark/rehype plugins, so it's not too hard to automatically have ids on headers at the desired level, e.g. https://github.com/rehypejs/rehype-slug.

I think it might be useful to provide additional attributes to the results.data() resulting JSON to describe the header nesting (in that image above the > show nesting page h1, h2, h3 structure)

Your recommended pagefind.yml config would be useful in settings where there are no ids and it's not possible to easily generate them. However, in that case, how would you jump to those headers in the url? For that reason, I think trying to adopt to this setting with no header IDs wouldn't create much value.

bglw commented 1 year ago

So a reason this is a little trickier is that the index actually works the other way around. If we craft a super simple example:

# a.html
<p>Page One</p>

# b.html
<p>Page Two</p>

Then a simplified version of the index (if it were JSON) would look something like:

{
  "pages": ["a.html", "b.html"],
  "words": {
    "one": [0],
    "page": [0, 1],
    "two": [1]
  }
}

This means if we're going to split the page, we need to do so at the time of indexing rather than the time of retrieval. So the index would allow for something like

"pages": ["a.html", "a.html#heading-one", "a.html#heading-two"]

With the words associated to each "page". There is a moment when the index is being built that we have it mapped the other way, though, so at that point we could split on headings and do what we need to do. It isn't outlandish, it will just need a careful refactor around the 1 file -> 1 page assumption baked in, and making it work with the way the HTML is parsed as a stream.

Quirk 1: Do we split the whole page from a heading, or do we try to match it hierarchically?

<h1>My Page</h1>
<p>My page text</p>
<div>
  <h2>Next heading</h2>
  <p>Inner heading text</p>
</div>
<p>Final text</p>

A naïve approach would split everything at the h2 here, but maybe the Final text at the bottom shouldn't be included?

bglw commented 1 year ago

In any case this has some similarities to some index weighting work that's teed up, so I'll likely wind up looking at them together in the not-too-distant future 🙂

kylebutts commented 1 year ago

Very interesting! I'll keep an eye out :-)

Unrelated, but this indexing method reminds me a lot of sparse matrices where it's more space efficient to just store the index of the pages instead of a vector of 0s and 1s. Not sure how you go about it in Rust, but might save space for larger indexes!

Anyways,thanks so much for this package. Love the idea of post-SSG processing utilities

bglw commented 1 year ago

Ah, that contrived example was a little too contrived 😅 A better example would be:

{
  "pages": ["a.html", "b.html", "c.html"],
  "words": {
    "one": [0],
    "page": [0, 1, 2],
    "two": [1],
    "three": [2]
  }
}

They are transferred as the sparse indexes, and then are turned back into bitsets clientside to search quickly

bennypowers commented 1 year ago

Does it have to be headers? Can we just put the closest or previous ID in the hash when present?

Consider this <table> of design tokens:

<table>
  <thead>
    <tr>
      <th>Token name</th>
      <th>Value</th>
    </tr>
  </thead>
  <tbody>
    <tr id="rh-color-accent-base-on-light">
      <!-- I'm a useful search term! -->
      <td data-pagefind-filter="token"><code>--rh-color-accent-base-on-light</code></td>
      <td><code>#0066cc</code></td>
    </tr>
    <tr id="rh-color-accent-base-on-dark">
      <!-- I'm a useful search term! -->
      <td data-pagefind-filter="token"><code>--rh-color-accent-base-on-dark</code></td>
      <td><code>#73bcf7</code></td>
    </tr>
  </tbody>
</table>

Here, I want the pagefind results to link to /tokens/color/#rh-color-accent-base-on-light etc. There's no header though, because this is a table.

Another way of saying this is that I want multiple results per page.

Say the user searched for accent, I'd like to get something like


[
  {
    content: 'Some more info about --rh-color-accent-base-on-dark',
    url: '/tokens/color/#rh-color-accent-base-on-dark',
  },
  {
    content: 'Some more info about --rh-color-accent-base-on-light',
    url: '/tokens/color/#rh-color-accent-base-on-light',
  },
]

bglw commented 1 year ago

Thanks for the samples — I'll definitely look at building this to allow multiple results per page, so there will be a way to achieve what you're after there 🙂

(NB: unrelated to this issue — looking at your code sample @bennypowers, I'll need to implement a way to index those design tokens as individual words rather than a single word. Currently that will index as a single word rh-color-accent-base-on-light and due to the index chunking, a search for accent won't bring it up. Let me know if that's needed and I'll get that in soon)

bennypowers commented 1 year ago

Yes that's correct will need to iced each taxa on the token name.

If I have to specify each one in an attr I don't mind that

<td data-pagefind-filter="token"
    data-pagefind-thingies="color,accent,base,on-light">
  <code>--rh-color-accent-base-on-light</code>
</td>

bglw commented 1 year ago

The way to achieve that right now would be to use index-attrs:

<td data-pagefind-filter="token"
       data-tokens="color accent base on-light"
       data-pagefind-index-attrs="data-tokens">
  <code>--rh-color-accent-base-on-light</code>
</td>

I'll have a think on an easier way to represent this without having to duplicate the content. Perhaps Pagefind should automatically index a word like color-accent as [color-accent, color, accent]

(EDIT: Created a new issue for this discussion at #225)

bennypowers commented 1 year ago

The way to achieve that right now...

Awesome thanks. Back to OP, I'll still need to link these back to the hash for the closest/previous ID

bglw commented 1 year ago

👍 I'll start implementing this fairly soon — my initial plan was for this to be part of a ✨ Pagefind 1.0 ✨ release, but I'll see how things track for whether this makes it out before that.

I also have a couple of ideas for an alternative way to implement this, so I might give those a poke and report back.

Are you using the Pagefind UI or the JS API directly?

bennypowers commented 1 year ago

Js API, so given token names and my conventions, I think I can construct URLs with your snippet. Will find out next week 🙂

bennypowers commented 1 year ago

just a quick update that I did index the token path parts, but those tags still apply to the whole page - I haven't found a way to associate those tags with a particular element on the page, or to forward those tags to the result so i can construct a url

bglw commented 1 year ago

Roger — I'll do some investigation on this very soon.

bglw commented 1 year ago

Quick question for your example. If I searched for accent, and there are multiple results on the page, would you want a unique search result for every heading to be shown, or just have the result link to the hash of the first closest heading?

In other words, are your ideal results for accent:

/tokens/color/#rh-color-accent-base-on-light
/tokens/color/#rh-color-accent-base-on-dark
/another/page/

Or just

/tokens/color/#rh-color-accent-base-on-light
/another/page/

bennypowers commented 1 year ago

First example. Multiple results per page

bglw commented 1 year ago

Hi @bennypowers / @kylebutts — initial update here. I'm still planning out how to best wrap this up as a feature, but I have now implemented what I think will be the primitive backing it.

It is currently sitting on a prerelease version. If you're running via npx, you can run:

npx pagefind@0.13.0-alpha.0 --source ...

or if you're pulling the binary directly you can download it from the pagefind-beta release.

The feature implemented thus far is that a list of page anchors is returned in the result data. (Additionally, the word locations are easier to access).

let search = await pagefind.search("filter");
let result = await search.results[0].data();

Returns:

{
  // some fields omitted
  "url": "/docs/filtering/",
  "anchors": [{
      "element": "h2",
      "id": "tagging-an-element-as-a-filter",
      "location": 18
    }, {
      "element": "h2",
      "id": "tagging-an-attribute-as-a-filter",
      "location": 87
    }],
  "locations": [ 3, 6, 23, 40, 51, 65, 93, 96, 107, 116 ]
}

This should be the data required to build a header-based result list. locations here represents the index of every matched word in the content, and the anchors array contains the corresponding location of the beginning of that element. With a little bit of logic (i.e. seeing that there are matching words between locations 18 and 87), you could synthesize a result for /docs/filtering/#tagging-an-element-as-a-filter.

No configuration is needed for the above example, as the main search indexes are unaffected, and the page fragment size is a lesser concern for Pagefind. As such, all elements with ids that live within the main Pagefind body are now tracked by default.

My intention is for Pagefind to implement this logic in some manner, but it needs a little more consideration for how it fits into the rest of the system. For example, in this configuration, each page is still one matched result, and the fragment data must be loaded before it could be split into sub-results. I think Pagefind will also need to try and index some text alongside the anchors if possible, so that a search result could be displayed as something along the lines of Setting up filters > Tagging an attribute as a filter.

In any case, I would love it if you gave this prerelease a spin! From the sounds of your setup consuming the API directly @bennypowers, I think this would be enough to unblock you. Eager to get any feedback on this feature as it shapes up.

kylebutts commented 1 year ago

Hi Liam!

This is great; I think the search results work great. Just need to get this working into a search component now :-)

bennypowers commented 1 year ago

I found I was unable to derive the kinds of results I wanted from pagefind, but having reconsidered my problem, it seemed that pagefind was not the right tool for the job. I instead opted for fuze.js, since I already possess a data file of my complete search results, and know ahead of time exactly what I'm searching for, and can build URLs for the search results by convention.

I'm however planning to adopt pagefind for its intended purpose, which is full-site offline search.

bglw commented 1 year ago

Note from #265 — the anchors returned should also try to include the text of the anchor, in the case of headings.

delucis commented 1 year ago

The way Algolia DocSearch does this is to chunk content into pretty small pieces and index those instead of whole pages, kind of as @bglw outlined in this comment.

Each piece of content has a metadata of heading hierarchy breadcrumbs, e.g. this HTML:

<h1 id="page-title">Page title</h1>
<h2 id="subheading">Subheading</h2>
<h3 id="details">Details</h3>
<p>Interesting stuff.</p>
<h3 id="more-details">More details</h3>
<p>Some content in a hierarchy of heading elements.</p>

Could return a result like:

{
  content: 'Some content in a hierarchy of heading elements.',
  url: '#more-details',
  hierarchy: {
    1: 'Page title',
    2: 'Subheading',
    3: 'More details',
  },
}

A nice part of heading hierarchies is they also make sense when sorting: you can decide to show <h2>My query</h2> after <h1>My query</h1> for example. Obviously, this is a bit specific to pages that have this kind of hierarchy like documentation or blog posts. Sites with other kinds of content that is more atomic may benefit more from treating each page as a single blob.

bglw commented 1 year ago

I haven't landed 100% on the implementation yet, but for now I've taken a different path than spitting the pages into separate chunks in the index. One reason is that if a search is a hit for two sections of a page, I quite like being able to show the result like:

Page Title
├─ Heading 1
│  └─ <result excerpt> 
└─ Heading 2
   └─ <result excerpt>

Due to the way Pagefind hashes and chunks content, there's no way to know that any two results are related to each other until their final fragments are loaded, which is usually lazy-loaded on scroll or pagination.

So the goal is to keep the results 1:1 with the input pages, but to mark them up such that a synthetic version of the heading split can be returned. This also has the benefit that you don't have to make any decision about this when indexing the site, it's entirely a runtime search config, which makes me happy. (It does mean that fully splitting one result into multiple will be tough if you're showing placeholders before it has loaded).

The general idea is that Pagefind will return this shape for each result (which it is currently doing, sans-header-text):

{
  // some fields omitted
  "url": "/docs/filtering/",
  "anchors": [{
      "element": "h2",
      "id": "tagging-an-element-as-a-filter",
      "text": "Tagging an element as a filter",
      "location": 18
    }, {
      "element": "h2",
      "id": "tagging-an-attribute-as-a-filter",
      "text": "Tagging an attribute as a filter",
      "location": 87
    }],
  "locations": [ 3, 6, 23, 40, 51, 65, 93, 96, 107, 116 ]
}

But you won't need to interact with that directly. Instead, the pagefind.js wrapper will be able to use this information to return a search result for only /docs/filtering/#tagging-an-element-as-a-filter, or return multiple results, or skip this behaviour altogether, depending on the options you pass in.

Let me know if you spot any glaring issues with this plan, though! Happy to hear more, but I think this strikes the right balance for Pagefind specifically 🙂

delucis commented 1 year ago

Nice! Makes sense.

Is anchors here, only anchors that match the search query? Or any heading on the page? Thinking about your example output where you're showing a kind of breadcrumb Page title > Heading > result excerpt, it would be good to be able to show Heading for matching excerpts even if the heading itself doesn't match.

bglw commented 1 year ago

Currently it's piping through all anchors that existed on the page, the anchors array is wholly unrelated to the search query — it's then up to the consuming js (ideally internally to Pagefind) to figure out which anchor matches the locations array (which contains the locations on the page that did match the search query).

So if you see that your search was a hit on the page at location 32, and there is an h3 heading at location 28, you can choose to give the result as under that h3 (or not, maybe you want to tie it to the h2 at an earlier location, etc).

This won't present anything regarding nesting/hierarchy, so if you're wanting to build a breadcrumb of headings you would need to reconstruct that from the list of anchors, if that makes sense.

anoopsinghbayes commented 1 year ago

i am looking forward to this feature i tried the alpha version(i think v1.0.0-alpha.5 is the latest) But i dont get the text,what i get is something like this


  // some fields omitted
  "url": "/docs/filtering/",
  "anchors": [{
      "element": "h2",
      "id": "tagging-an-element-as-a-filter",
      "text": null,
      "location": 18
    }, {
      "element": "h2",
      "id": "tagging-an-attribute-as-a-filter",
      "text": null,
      "location": 87
    }],
  "locations": [ 3, 6, 23, 40, 51, 65, 93, 96, 107, 116 ]
}```

i was  expecting this
```json
{
  // some fields omitted
  "url": "/docs/filtering/",
  "anchors": [{
      "element": "h2",
      "id": "tagging-an-element-as-a-filter",
      "text": "Tagging an element as a filter",
      "location": 18
    }, {
      "element": "h2",
      "id": "tagging-an-attribute-as-a-filter",
      "text": "Tagging an attribute as a filter",
      "location": 87
    }],
  "locations": [ 3, 6, 23, 40, 51, 65, 93, 96, 107, 116 ]
}
}```

bglw commented 1 year ago

Hi @anoopsinghbayes 👋

The text field is not (yet) being populated — I'm working on this very soon so expect an update in the coming weeks 🙂

bglw commented 1 year ago

Hey @anoopsinghbayes / all,

The v1.0.0-alpha.9 release includes the text field in the anchors list, there are some notes on how this extracted in #369

Let me know if you take a look at it — automatic results for headings calculated by Pagefind will come soon.

anoopsinghbayes commented 1 year ago

@bglw checked , I am able to get the text ,thanks a lot

bglw commented 1 year ago

Hello everyone ! 👋

Great news — this has landed in Pagefind v1.0.0! ✨

See the full release notes here: https://github.com/CloudCannon/pagefind/releases/tag/v1.0.0 💙

And the documentation here: https://pagefind.app/docs/sub-results/

kylebutts commented 1 year ago

Congrats @bglw! This is really exciting stuff !! Some interesting things going on in the astro discord implementing this in starlight right now too :-)

CloudCannon / pagefind

Search index based on headers #215

Automatic

Attribute