loadFragment() with given hash

indus commented 1 year ago

Is it possible to directly load a fragment for a hash that was not obtained by the search? Something like a public version of the loadFragment function: https://github.com/CloudCannon/pagefind/blob/main/pagefind_web_js/lib/coupled_search.ts#L234 ?

bglw commented 1 year ago

That could definitely be exposed, though Pagefind offers no method for finding a hash you're looking for outside of a search result. What's the use-case you're looking to fill here? (How are you planning to get the hash to pass to this function?) There might be a better way to get there 🙂

indus commented 1 year ago

I may have to give you some background on this. I'm using pagefind on geospatial datasets that come with a title, description, quicklook images, etc. (the usual :-). It does a really great job and allows me to do 98% of my search needs in a wonderful easy way (thanks for the lib btw.).

The remaining 2% of search functionality I would like to implement that page can't solve is a geospatial search like bounding-box intersection. There are other very performant libraries (like https://github.com/mourner/flatbush) to do this. My idea is to just have a small list of structs with only a boundingbox and the pagefind hash to feed one of those libraries wirh it. I would then make my intersection and use the hash to get the metadata (title, image, url, etc.) directly from the pagefind index.

To build the list of bbox+hash structs I planned to just query all data from all records with a search term 'null'; not at runtime but at build time (build page -> build pagefind index -> build geospatial index).

I hope this makes sense.

indus commented 1 year ago

I'm not sure how relevant it would be for other applications and what it would mean for the pagefind code, but maybe an attribute like pagefind-hash='<custom_hash>' that allows for a custom user defined hash would make this database like (mis-)usage of the pagefind index even easier and more flexible. Or as an alternative an option to write a plain JSON file of the index at build time with the hash as key and the metadata as value?!?

bglw commented 1 year ago

Ah, cool! Nice use-case.

The purpose of the hashes is to eliminate any stale caching issues, so I'd be hesitant to provide a custom hash functionality. The option for Pagefind to write a plain JSON file is totally doable, though, I'll look into that. And no reason the explicit call to load a fragment can't be exposed, so I'll tackle that too.

indus commented 1 year ago

Thanks for your effort.

julbd commented 2 months ago

@bglw I have this issue with the node library. A quick solution could be to return the hash, alongside others data, when the record is created.

marcuswhybrow commented 2 months ago

A quick solution could be to return the hash, alongside others data, when the record is created.

This is exactly what I need too. In my project I have 500 indexed HTML files all of which I display in my web UI. Displaying and visually filtering this many elements forces me to await data() for every result to get the url to identify each file displayed in my UI.

Work Arounds

My workaround is to create a reverse lookup table from result url (hidden by the await data()) to result id by calling pagefind.search(null) in the background on page load. Awaiting 500 calls to data() massively blocks the event loop leading to UI freeze.

To fix the UI freezing, one needs the Scheduler API or a setTimeout hack to break up the task sizes and allow UI updates some execution time.

I also tried moving await data() and Pagefind searching in general into service worker (so it's on a separate thread and not blocking the main event loop) but Pagefind complains that window doesn't exist (see #605).

Solution?

If pagefind.addHTMLFile (in the node wrapper) returned the result id, this would solve my main issue trying to filter hundreds of results efficiently.

bglw commented 2 months ago

Hi all 👋

I'll be working on this one soon, along with #715

Both will come via a CLI flag to output a file containing information about the index — filters, fragments, etc. This will be output at the conclusion of the build.

The API will gain a matching function, something like await index.getIndexCatalogue() (name pending 🤷). This would be called between adding the last file to the index and writing content, or possibly be a return value from writing content. TBD. In any case, let me know if that sounds like it will be viable 🙂

If pagefind.addHTMLFile (in the node wrapper) returned the result id, this would solve my main issue

Unfortunately this one isn't possible without some more changes. At present, the IDs aren't allocated until the conclusion of indexing, so they aren't known at the point of responding to any of the add* functions.

julbd commented 2 months ago

Hi @bglw. Thank you for listening for our issues :)

In my use case, the best solution would be to have the record hash directly returned by addCustomRecord().

bglw commented 2 months ago

Hmm, well that needs some more thought 😅

Just to rattle off some thoughts, for context and for myself:

Pagefind uses fairly short page IDs, to reduce the size of the metadata it needs to load up front. The downside of this is that collisions can and do happen, so the IDs are allocated at the end of the indexing, and pages will adjust their hash if it would collide. One goal for this is that both pages should adjust, which means the ID of a page may need to change after it has been allocated.

So the big issue is until all files have been indexed, we don't know how short to make the page ID.

The primary purpose of these hashes is to solve caching issues when the index changes after a build, so I'm hesitant to change the strategy too significantly.

One idea that might work would be to adopt a git-ish concept of short and long IDs, and return the long ID from the add* functions. So your response would come back with a record hash like en_11badb2e36d2246bc6756b4a2f38479d3893692. Ultimately that page will be stored as en_11badb2, or en_11badb2e, or maybe even en_11badb2e3 — in any case it'll be a prefix of the full page hash. Then Pagefind would then allow you to supply a full length page hash and it'll find the relevant fragment.

With that:

Calling something like addCustomRecord() would return:

{
  uniqueWords: 1234,
  url: "....",
  meta: { /* ... */ },
  long_id: "en_11badb2e36d2246bc6756b4a2f38479d3893692"
}

When finished indexing, getIndexCatalogue would be able to return both the long and short ids for any given page.
The loadFragment function would accept either a long id or a short id

marcuswhybrow commented 2 months ago

Hi @bglw

Given the ID de-duplication restrictions you mention, it's fine for my use case to leave addHTMLFile as is in favor of this new index.getIndexCatalogue(). So long as I can, somewhere at compile time, determine which ultimate result id matches which result url it's fine if this happens at the end of indexing.

Cheers.

julbd commented 2 months ago

Thank you for your elaborated answer. I understand the issue you are facing and why the ID is not already returned on record creation. Your solution would work but I see two issues :

If the goal of sort ID is to reduce bandwidth usage (it sounds like micro-optimization, but ok), then you'll have the same bandwidth issue with long ID (which will be used in a client-side reverse-lookup table).
KISS (Keep it stupidly simple).

I would rather suggest, if possible, to check the ID availability (and regenerate it if duplicate) at creation time. However, that's fine, I can use getIndexCatalogue() too !

Edit : i'm thinking of the following solution that would address more directly our use-cases. In my case, I have a map with points. I need to know the location of all the points (with and without filters). But I need only need the location. Currently, I am relying on a pre-generated JSON file to retrieve the location from the fragment ID, without having to fetch each fragment individually.

This issue could be resolved with the combination of :

Multiples "views", according to the data we need to fetch (all of it or a small portion of it). But this can already be achieved by generating two seperate indexes.
Fragments "packs". The search client would prefer to load some pre-generated packs of fragments (by filters, common words, user-defined key, etc...) when it detects than the cost of additional data (do we care with today's traffic speed ?) transfer is lower than the cost of the number of requests (do we care with HTTP2 ?).

marcuswhybrow commented 2 months ago

As a user story, my initial developer ergonomics expectation was that search(null) would return { results } that each had some id that pointed "back" to the indexed content given. Pagefind calls this the url, I think, but the url is inside the fragment. What { results } contains is an id, which, because it contains the word "unknown", lead me to believe I was supposed to supply the id somewhere for each Pagefind indexed item.

After some digging, I now understand that id points "forward" to the fragment henceforth to be loaded by data().

One possibility, perhaps too piecemeal a change, is including url in { results } to open up "backward" referencing to arbitrary data. Personally, I think getIndexCatalogue() is the better alternative since, although a "fragment id to url lookup data-structure" is being transferred to the client in either case, the getIndexCatalogue() approach is an opt-in cost.

@julbd, getIndexCatalogue() alone solves my page load issue (as well as fast filtering).

@bglw however, once that's solved, arbitrary (build time) fragment splitting could indeed massively reduce my UI's search times. For hundreds of indexed items I only need { id, excerpt } from { sub_results }, but for that I'm loading every word in the index document too, since it's all in the same fragment. Currently I'm amortising this cost by loading fragments for visible results first, then loading the out of viewport stuff afterwards.

I think it's fair to say that loading hundreds of fragments could be considered out of scope for Pagefind. I'm sure 95% of applications are paginating results. Also fragment splitting sounds like a major rewrite of core functionality.

Anyway, just food for thought. And thanks for helping. getIndexCatalogue() is probably enough for me. Cheers.

bglw commented 2 months ago

👋 @julbd

you'll have the same bandwidth issue with long ID

Correct! That's the limitation. For people loading them all into a client-side bundle, the recommendation would be to use the indexCatalogue to look up the corresponding short hash — but at that point you may as well just rely on the indexCatalogue for everything.

From my side, I'll continue with the indexCatalogue idea and we'll see how it goes, but we can revisit the idea of returning hashes while indexing if it seems crucial!

I would rather suggest, if possible, to check the ID availability (and regenerate it if duplicate) at creation time.

The main blocker here is that:

IDs need to be hashes to make subsequent builds stable
Indexing should not be order-dependent

Importantly for the second one, playing through a scenario:

We have a page A with hash abcfm
We have a page B with hash abcrt
Page A is indexed first, and takes the ID abc
Page B is indexed and tries to take abc, fails, and instead takes abcr
Pagefind runs again, but this time indexes Page B first, so it succeeds in getting ID abc
Page A now ends up with the ID of abcf

Now if any user has the hash fragment for abc still cached from build 1, but they search using build 2, they'll get the fragment for Page A but it should be for Page B. Hence, the ideal situation is both pages change due to the clash, and take the IDs abcf and abcr.

We are getting into micro-optimizations here! But these are also all scenarios that have been encountered with Pagefind in practice 😅

Fragments "packs"

This is an interesting idea! I like it 🤔 It feels tangential to this issue, would you mind opening a new one for that? :)

👋 @marcuswhybrow

because it contains the word "unknown", lead me to believe I was supposed to supply the id somewhere for each Pagefind indexed item

Ah, the unknown prefix there is actually the language! Normally you would see the ID as en_... or fr_.... In the case you have no language attribute on your HTML element, you get the unknown_ language prefix (and webassembly). (Side point, I'd recommend setting the language! In the unknown language you'll miss out on some word stemming)

in either case, the getIndexCatalogue() approach is an opt-in cost

Agreed! The URL not being returned is quite intentional, so I'd be resistant to adding it. (Currently all IDs are loaded up front with Pagefind, and loading the URLs at the same time would start getting heavy). I like that the indexCatalogue concept gives an extension to some of these niche use cases where it's needed without impacting the base case for bandwidth.

arbitrary (build time) fragment splitting

Can you elaborate? The two ways I can read this is:

Fragments are combined together into larger files and loaded as large masses
Many fields from the fragments are removed to make each one smaller

I'm sure 95% of applications are paginating results

Correct! Or my favorites use an IntersectionObserver to load the fragment when the result enters the viewport :)

marcuswhybrow commented 2 months ago

@bglw

the unknown prefix there is actually the language

😆 Cheers.

I've sent you PR #719 with two minor additions to the getting stared docs re the lang attribute and it's relationship to result ids. I think the PR prevents my bad interpretation for other new users.

arbitrary (build time) fragment splitting

Can you elaborate?

Option 2 (+ extras): Multple fragments for each indexed file:

Currently (I think) Pagefind loads a singular fragment file for each result's data() call (which can be cached).
I believe this fragment contains all the data that Pagefind has for that result.
At build-time, the UI code may know that only a subset of that fragment data is useful.
What if one could call result.data("subset-name") to load a subset of fragment fields.
To achieve this, the node API could expose index.defineFragmentSubset("subset-name", fragment => {}) to generate an arbitrary number of "fragment subset" files for each indexed document during index.writeFiles.
For most use cases a single call to data("subset-name") could replace all calls to data(), reducing overall bandwidth.
In complex scenarios, one could call data("subset-a") and then later data("subset-b").
data() would still be available and perform as it, perfect for users who haven't specified any subsets.
There's also an opt-in opportunity to never generate full fragments at all, saving hosting space for those who never call data without a subset name.

This opt-in generation of fragment subsets would allow users to make their own trade-offs between the number of HTTP requests required and (even further) reduced bandwidth (reduced search times).

I'm not overly familiar with the Pagefind code-base itself, so take my idea with a pinch of salt, but that's my conceptualisation of @julbd's idea.

my favorites use an IntersectionObserver

I think I'll give that a go!

CloudCannon / pagefind

loadFragment() with given hash #371