Open indus opened 1 year ago
That could definitely be exposed, though Pagefind offers no method for finding a hash you're looking for outside of a search result. What's the use-case you're looking to fill here? (How are you planning to get the hash to pass to this function?) There might be a better way to get there π
I may have to give you some background on this. I'm using pagefind on geospatial datasets that come with a title, description, quicklook images, etc. (the usual :-). It does a really great job and allows me to do 98% of my search needs in a wonderful easy way (thanks for the lib btw.).
The remaining 2% of search functionality I would like to implement that page can't solve is a geospatial search like bounding-box intersection. There are other very performant libraries (like https://github.com/mourner/flatbush) to do this. My idea is to just have a small list of structs with only a boundingbox and the pagefind hash to feed one of those libraries wirh it. I would then make my intersection and use the hash to get the metadata (title, image, url, etc.) directly from the pagefind index.
To build the list of bbox+hash structs I planned to just query all data from all records with a search term 'null'; not at runtime but at build time (build page -> build pagefind index -> build geospatial index).
I hope this makes sense.
I'm not sure how relevant it would be for other applications and what it would mean for the pagefind code, but maybe an attribute like pagefind-hash='<custom_hash>'
that allows for a custom user defined hash would make this database like (mis-)usage of the pagefind index even easier and more flexible.
Or as an alternative an option to write a plain JSON file of the index at build time with the hash as key and the metadata as value?!?
Ah, cool! Nice use-case.
The purpose of the hashes is to eliminate any stale caching issues, so I'd be hesitant to provide a custom hash functionality. The option for Pagefind to write a plain JSON file is totally doable, though, I'll look into that. And no reason the explicit call to load a fragment can't be exposed, so I'll tackle that too.
Thanks for your effort.
@bglw I have this issue with the node library. A quick solution could be to return the hash, alongside others data, when the record is created.
A quick solution could be to return the hash, alongside others data, when the record is created.
This is exactly what I need too. In my project I have 500 indexed HTML files all of which I display in my web UI. Displaying and visually filtering this many elements forces me to await data()
for every result to get the url
to identify each file displayed in my UI.
Work Arounds
My workaround is to create a reverse lookup table from result url
(hidden by the await data()
) to result id
by calling pagefind.search(null)
in the background on page load. Awaiting 500 calls to data()
massively blocks the event loop leading to UI freeze.
To fix the UI freezing, one needs the Scheduler API or a setTimeout
hack to break up the task sizes and allow UI updates some execution time.
I also tried moving await data()
and Pagefind searching in general into service worker (so it's on a separate thread and not blocking the main event loop) but Pagefind complains that window
doesn't exist (see #605).
Solution?
If pagefind.addHTMLFile
(in the node wrapper) returned the result id
, this would solve my main issue trying to filter hundreds of results efficiently.
Hi all π
I'll be working on this one soon, along with #715
Both will come via a CLI flag to output a file containing information about the index βΒ filters, fragments, etc. This will be output at the conclusion of the build.
The API will gain a matching function, something like await index.getIndexCatalogue()
(name pending π€·). This would be called between adding the last file to the index and writing content, or possibly be a return value from writing content. TBD. In any case, let me know if that sounds like it will be viable π
If pagefind.addHTMLFile (in the node wrapper) returned the result id, this would solve my main issue
Unfortunately this one isn't possible without some more changes. At present, the IDs aren't allocated until the conclusion of indexing, so they aren't known at the point of responding to any of the add*
functions.
Hi @bglw. Thank you for listening for our issues :)
In my use case, the best solution would be to have the record hash directly returned by addCustomRecord()
.
Hmm, well that needs some more thought π
Just to rattle off some thoughts, for context and for myself:
Pagefind uses fairly short page IDs, to reduce the size of the metadata it needs to load up front. The downside of this is that collisions can and do happen, so the IDs are allocated at the end of the indexing, and pages will adjust their hash if it would collide. One goal for this is that both pages should adjust, which means the ID of a page may need to change after it has been allocated.
So the big issue is until all files have been indexed, we don't know how short to make the page ID.
The primary purpose of these hashes is to solve caching issues when the index changes after a build, so I'm hesitant to change the strategy too significantly.
One idea that might work would be to adopt a git-ish concept of short and long IDs, and return the long ID from the add*
functions. So your response would come back with a record hash like en_11badb2e36d2246bc6756b4a2f38479d3893692
. Ultimately that page will be stored as en_11badb2
, or en_11badb2e
, or maybe even en_11badb2e3
βΒ in any case it'll be a prefix of the full page hash. Then Pagefind would then allow you to supply a full length page hash and it'll find the relevant fragment.
With that:
addCustomRecord()
would return:
{
uniqueWords: 1234,
url: "....",
meta: { /* ... */ },
long_id: "en_11badb2e36d2246bc6756b4a2f38479d3893692"
}
loadFragment
function would accept either a long id or a short idHi @bglw
Given the ID de-duplication restrictions you mention, it's fine for my use case to leave addHTMLFile
as is in favor of this new index.getIndexCatalogue()
. So long as I can, somewhere at compile time, determine which ultimate result id
matches which result url
it's fine if this happens at the end of indexing.
Cheers.
Thank you for your elaborated answer. I understand the issue you are facing and why the ID is not already returned on record creation. Your solution would work but I see two issues :
I would rather suggest, if possible, to check the ID availability (and regenerate it if duplicate) at creation time. However, that's fine, I can use getIndexCatalogue()
too !
Edit : i'm thinking of the following solution that would address more directly our use-cases. In my case, I have a map with points. I need to know the location of all the points (with and without filters). But I need only need the location. Currently, I am relying on a pre-generated JSON file to retrieve the location from the fragment ID, without having to fetch each fragment individually.
This issue could be resolved with the combination of :
As a user story, my initial developer ergonomics expectation was that search(null)
would return { results }
that each had some id
that pointed "back" to the indexed content given. Pagefind calls this the url
, I think, but the url
is inside the fragment. What { results }
contains is an id
, which, because it contains the word "unknown", lead me to believe I was supposed to supply the id
somewhere for each Pagefind indexed item.
After some digging, I now understand that id
points "forward" to the fragment henceforth to be loaded by data()
.
One possibility, perhaps too piecemeal a change, is including url
in { results }
to open up "backward" referencing to arbitrary data. Personally, I think getIndexCatalogue()
is the better alternative since, although a "fragment id to url lookup data-structure" is being transferred to the client in either case, the getIndexCatalogue()
approach is an opt-in cost.
@julbd, getIndexCatalogue()
alone solves my page load issue (as well as fast filtering).
@bglw however, once that's solved, arbitrary (build time) fragment splitting could indeed massively reduce my UI's search times. For hundreds of indexed items I only need { id, excerpt }
from { sub_results }
, but for that I'm loading every word in the index document too, since it's all in the same fragment. Currently I'm amortising this cost by loading fragments for visible results first, then loading the out of viewport stuff afterwards.
I think it's fair to say that loading hundreds of fragments could be considered out of scope for Pagefind. I'm sure 95% of applications are paginating results. Also fragment splitting sounds like a major rewrite of core functionality.
Anyway, just food for thought. And thanks for helping. getIndexCatalogue()
is probably enough for me. Cheers.
π @julbd
you'll have the same bandwidth issue with long ID
Correct! That's the limitation. For people loading them all into a client-side bundle, the recommendation would be to use the indexCatalogue to look up the corresponding short hash βΒ but at that point you may as well just rely on the indexCatalogue for everything.
From my side, I'll continue with the indexCatalogue idea and we'll see how it goes, but we can revisit the idea of returning hashes while indexing if it seems crucial!
I would rather suggest, if possible, to check the ID availability (and regenerate it if duplicate) at creation time.
The main blocker here is that:
Importantly for the second one, playing through a scenario:
abcfm
abcrt
abc
abc
, fails, and instead takes abcr
abc
abcf
Now if any user has the hash fragment for abc
still cached from build 1, but they search using build 2, they'll get the fragment for Page A but it should be for Page B. Hence, the ideal situation is both pages change due to the clash, and take the IDs abcf
and abcr
.
We are getting into micro-optimizations here! But these are also all scenarios that have been encountered with Pagefind in practice π
Fragments "packs"
This is an interesting idea! I like it π€ It feels tangential to this issue, would you mind opening a new one for that? :)
π @marcuswhybrow
because it contains the word "unknown", lead me to believe I was supposed to supply the id somewhere for each Pagefind indexed item
Ah, the unknown
prefix there is actually the language! Normally you would see the ID as en_...
or fr_...
. In the case you have no language attribute on your HTML element, you get the unknown_
language prefix (and webassembly).
(Side point, I'd recommend setting the language! In the unknown language you'll miss out on some word stemming)
in either case, the getIndexCatalogue() approach is an opt-in cost
Agreed! The URL not being returned is quite intentional, so I'd be resistant to adding it. (Currently all IDs are loaded up front with Pagefind, and loading the URLs at the same time would start getting heavy). I like that the indexCatalogue concept gives an extension to some of these niche use cases where it's needed without impacting the base case for bandwidth.
arbitrary (build time) fragment splitting
Can you elaborate? The two ways I can read this is:
I'm sure 95% of applications are paginating results
Correct! Or my favorites use an IntersectionObserver to load the fragment when the result enters the viewport :)
@bglw
the unknown prefix there is actually the language
π Cheers.
I've sent you PR #719 with two minor additions to the getting stared docs re the lang
attribute and it's relationship to result ids. I think the PR prevents my bad interpretation for other new users.
arbitrary (build time) fragment splitting
Can you elaborate?
Option 2 (+ extras): Multple fragments for each indexed file:
data()
call (which can be cached). result.data("subset-name")
to load a subset of fragment fields.index.defineFragmentSubset("subset-name", fragment => {})
to generate an arbitrary number of "fragment subset" files for each indexed document during index.writeFiles
.data("subset-name")
could replace all calls to data()
, reducing overall bandwidth.data("subset-a")
and then later data("subset-b")
.data()
would still be available and perform as it, perfect for users who haven't specified any subsets.data
without a subset name.This opt-in generation of fragment subsets would allow users to make their own trade-offs between the number of HTTP requests required and (even further) reduced bandwidth (reduced search times).
I'm not overly familiar with the Pagefind code-base itself, so take my idea with a pinch of salt, but that's my conceptualisation of @julbd's idea.
my favorites use an IntersectionObserver
I think I'll give that a go!
Is it possible to directly load a fragment for a hash that was not obtained by the search? Something like a public version of the
loadFragment
function: https://github.com/CloudCannon/pagefind/blob/main/pagefind_web_js/lib/coupled_search.ts#L234 ?