CloudCannon / pagefind

Static low-bandwidth search at scale
https://pagefind.app
MIT License
3.66k stars 120 forks source link

loadFragment() with given hash #371

Open indus opened 1 year ago

indus commented 1 year ago

Is it possible to directly load a fragment for a hash that was not obtained by the search? Something like a public version of the loadFragment function: https://github.com/CloudCannon/pagefind/blob/main/pagefind_web_js/lib/coupled_search.ts#L234 ?

bglw commented 1 year ago

That could definitely be exposed, though Pagefind offers no method for finding a hash you're looking for outside of a search result. What's the use-case you're looking to fill here? (How are you planning to get the hash to pass to this function?) There might be a better way to get there πŸ™‚

indus commented 1 year ago

I may have to give you some background on this. I'm using pagefind on geospatial datasets that come with a title, description, quicklook images, etc. (the usual :-). It does a really great job and allows me to do 98% of my search needs in a wonderful easy way (thanks for the lib btw.).

The remaining 2% of search functionality I would like to implement that page can't solve is a geospatial search like bounding-box intersection. There are other very performant libraries (like https://github.com/mourner/flatbush) to do this. My idea is to just have a small list of structs with only a boundingbox and the pagefind hash to feed one of those libraries wirh it. I would then make my intersection and use the hash to get the metadata (title, image, url, etc.) directly from the pagefind index.

To build the list of bbox+hash structs I planned to just query all data from all records with a search term 'null'; not at runtime but at build time (build page -> build pagefind index -> build geospatial index).

I hope this makes sense.

indus commented 1 year ago

I'm not sure how relevant it would be for other applications and what it would mean for the pagefind code, but maybe an attribute like pagefind-hash='<custom_hash>' that allows for a custom user defined hash would make this database like (mis-)usage of the pagefind index even easier and more flexible. Or as an alternative an option to write a plain JSON file of the index at build time with the hash as key and the metadata as value?!?

bglw commented 1 year ago

Ah, cool! Nice use-case.

The purpose of the hashes is to eliminate any stale caching issues, so I'd be hesitant to provide a custom hash functionality. The option for Pagefind to write a plain JSON file is totally doable, though, I'll look into that. And no reason the explicit call to load a fragment can't be exposed, so I'll tackle that too.

indus commented 1 year ago

Thanks for your effort.

julbd commented 2 months ago

@bglw I have this issue with the node library. A quick solution could be to return the hash, alongside others data, when the record is created.

marcuswhybrow commented 2 months ago

A quick solution could be to return the hash, alongside others data, when the record is created.

This is exactly what I need too. In my project I have 500 indexed HTML files all of which I display in my web UI. Displaying and visually filtering this many elements forces me to await data() for every result to get the url to identify each file displayed in my UI.

Work Arounds

My workaround is to create a reverse lookup table from result url (hidden by the await data()) to result id by calling pagefind.search(null) in the background on page load. Awaiting 500 calls to data() massively blocks the event loop leading to UI freeze.

To fix the UI freezing, one needs the Scheduler API or a setTimeout hack to break up the task sizes and allow UI updates some execution time.

I also tried moving await data() and Pagefind searching in general into service worker (so it's on a separate thread and not blocking the main event loop) but Pagefind complains that window doesn't exist (see #605).

Solution?

If pagefind.addHTMLFile (in the node wrapper) returned the result id, this would solve my main issue trying to filter hundreds of results efficiently.

bglw commented 2 months ago

Hi all πŸ‘‹

I'll be working on this one soon, along with #715

Both will come via a CLI flag to output a file containing information about the index β€”Β filters, fragments, etc. This will be output at the conclusion of the build.

The API will gain a matching function, something like await index.getIndexCatalogue() (name pending 🀷). This would be called between adding the last file to the index and writing content, or possibly be a return value from writing content. TBD. In any case, let me know if that sounds like it will be viable πŸ™‚

If pagefind.addHTMLFile (in the node wrapper) returned the result id, this would solve my main issue

Unfortunately this one isn't possible without some more changes. At present, the IDs aren't allocated until the conclusion of indexing, so they aren't known at the point of responding to any of the add* functions.

julbd commented 2 months ago

Hi @bglw. Thank you for listening for our issues :)

In my use case, the best solution would be to have the record hash directly returned by addCustomRecord().

bglw commented 2 months ago

Hmm, well that needs some more thought πŸ˜…

Just to rattle off some thoughts, for context and for myself:

Pagefind uses fairly short page IDs, to reduce the size of the metadata it needs to load up front. The downside of this is that collisions can and do happen, so the IDs are allocated at the end of the indexing, and pages will adjust their hash if it would collide. One goal for this is that both pages should adjust, which means the ID of a page may need to change after it has been allocated.

So the big issue is until all files have been indexed, we don't know how short to make the page ID.

The primary purpose of these hashes is to solve caching issues when the index changes after a build, so I'm hesitant to change the strategy too significantly.

One idea that might work would be to adopt a git-ish concept of short and long IDs, and return the long ID from the add* functions. So your response would come back with a record hash like en_11badb2e36d2246bc6756b4a2f38479d3893692. Ultimately that page will be stored as en_11badb2, or en_11badb2e, or maybe even en_11badb2e3 β€”Β in any case it'll be a prefix of the full page hash. Then Pagefind would then allow you to supply a full length page hash and it'll find the relevant fragment.

With that:

marcuswhybrow commented 2 months ago

Hi @bglw

Given the ID de-duplication restrictions you mention, it's fine for my use case to leave addHTMLFile as is in favor of this new index.getIndexCatalogue(). So long as I can, somewhere at compile time, determine which ultimate result id matches which result url it's fine if this happens at the end of indexing.

Cheers.

julbd commented 2 months ago

Thank you for your elaborated answer. I understand the issue you are facing and why the ID is not already returned on record creation. Your solution would work but I see two issues :

I would rather suggest, if possible, to check the ID availability (and regenerate it if duplicate) at creation time. However, that's fine, I can use getIndexCatalogue() too !


Edit : i'm thinking of the following solution that would address more directly our use-cases. In my case, I have a map with points. I need to know the location of all the points (with and without filters). But I need only need the location. Currently, I am relying on a pre-generated JSON file to retrieve the location from the fragment ID, without having to fetch each fragment individually.

This issue could be resolved with the combination of :

marcuswhybrow commented 2 months ago

As a user story, my initial developer ergonomics expectation was that search(null) would return { results } that each had some id that pointed "back" to the indexed content given. Pagefind calls this the url, I think, but the url is inside the fragment. What { results } contains is an id, which, because it contains the word "unknown", lead me to believe I was supposed to supply the id somewhere for each Pagefind indexed item.

After some digging, I now understand that id points "forward" to the fragment henceforth to be loaded by data().

One possibility, perhaps too piecemeal a change, is including url in { results } to open up "backward" referencing to arbitrary data. Personally, I think getIndexCatalogue() is the better alternative since, although a "fragment id to url lookup data-structure" is being transferred to the client in either case, the getIndexCatalogue() approach is an opt-in cost.

@julbd, getIndexCatalogue() alone solves my page load issue (as well as fast filtering).

@bglw however, once that's solved, arbitrary (build time) fragment splitting could indeed massively reduce my UI's search times. For hundreds of indexed items I only need { id, excerpt } from { sub_results }, but for that I'm loading every word in the index document too, since it's all in the same fragment. Currently I'm amortising this cost by loading fragments for visible results first, then loading the out of viewport stuff afterwards.

I think it's fair to say that loading hundreds of fragments could be considered out of scope for Pagefind. I'm sure 95% of applications are paginating results. Also fragment splitting sounds like a major rewrite of core functionality.

Anyway, just food for thought. And thanks for helping. getIndexCatalogue() is probably enough for me. Cheers.

bglw commented 2 months ago

πŸ‘‹ @julbd

you'll have the same bandwidth issue with long ID

Correct! That's the limitation. For people loading them all into a client-side bundle, the recommendation would be to use the indexCatalogue to look up the corresponding short hash β€”Β but at that point you may as well just rely on the indexCatalogue for everything.

From my side, I'll continue with the indexCatalogue idea and we'll see how it goes, but we can revisit the idea of returning hashes while indexing if it seems crucial!

I would rather suggest, if possible, to check the ID availability (and regenerate it if duplicate) at creation time.

The main blocker here is that:

Importantly for the second one, playing through a scenario:

Now if any user has the hash fragment for abc still cached from build 1, but they search using build 2, they'll get the fragment for Page A but it should be for Page B. Hence, the ideal situation is both pages change due to the clash, and take the IDs abcf and abcr.

We are getting into micro-optimizations here! But these are also all scenarios that have been encountered with Pagefind in practice πŸ˜…

Fragments "packs"

This is an interesting idea! I like it πŸ€” It feels tangential to this issue, would you mind opening a new one for that? :)

πŸ‘‹ @marcuswhybrow

because it contains the word "unknown", lead me to believe I was supposed to supply the id somewhere for each Pagefind indexed item

Ah, the unknown prefix there is actually the language! Normally you would see the ID as en_... or fr_.... In the case you have no language attribute on your HTML element, you get the unknown_ language prefix (and webassembly). (Side point, I'd recommend setting the language! In the unknown language you'll miss out on some word stemming)

in either case, the getIndexCatalogue() approach is an opt-in cost

Agreed! The URL not being returned is quite intentional, so I'd be resistant to adding it. (Currently all IDs are loaded up front with Pagefind, and loading the URLs at the same time would start getting heavy). I like that the indexCatalogue concept gives an extension to some of these niche use cases where it's needed without impacting the base case for bandwidth.

arbitrary (build time) fragment splitting

Can you elaborate? The two ways I can read this is:

I'm sure 95% of applications are paginating results

Correct! Or my favorites use an IntersectionObserver to load the fragment when the result enters the viewport :)

marcuswhybrow commented 2 months ago

@bglw

the unknown prefix there is actually the language

πŸ˜† Cheers.

I've sent you PR #719 with two minor additions to the getting stared docs re the lang attribute and it's relationship to result ids. I think the PR prevents my bad interpretation for other new users.

arbitrary (build time) fragment splitting

Can you elaborate?

Option 2 (+ extras): Multple fragments for each indexed file:

This opt-in generation of fragment subsets would allow users to make their own trade-offs between the number of HTTP requests required and (even further) reduced bandwidth (reduced search times).

I'm not overly familiar with the Pagefind code-base itself, so take my idea with a pinch of salt, but that's my conceptualisation of @julbd's idea.

my favorites use an IntersectionObserver

I think I'll give that a go!