Open rauschma opened 2 years ago
👋 Hello @rauschma!
Interesting! I can definitely see a couple of ways this could work, I have a few questions about what your ideal state would look like:
npx pagefind --incremental
vs npx pagefind --reindex-files <PATH,PATH,PATH>
Regarding the Node.js API — that isn't out of the question (it would just be abstracting away the CLI-via-child-process step). The current pagefind
npm package already does the postinstall work of installing the CLI binary, so it would be fairly trivial to add a node-facing API to that package that uses the same binary.
Love the sound of a new incremental SSG, happy to help shape Pagefind into a useful state for it! 🙂
Would you want Pagefind to have its own incremental detection/handling, or would you be passing Pagefind a list of files to index that have been changed, and Pagefind should ignore the rest?
pagefind --incremental
may be useful for others.Would you be wanting Pagefind to try and incrementally modify its built index, or would you be fine with Pagefind re-generating its index files?
Whichever works for you! Pagefind could also initially regenerate everything and later optimize incremental generation (if possible). Users of Pagefind should not notice the difference(?)
In incremental mode, I’ll probably copy files that have changed and delete files that have disappeared. This approach should adapt to however Pagefind works.
Regarding the Node.js API — that isn't out of the question
If you have the time to do that then that’d be great! You having control over such an API also gives you the option to do more with it later.
Love the sound of a new incremental SSG, happy to help shape Pagefind into a useful state for it! 🙂
Cool, I’ll keep you posted!
On second thought – 3 kinds of files exist and should be supported for incremental updates:
Ah, yes, good call on the deleted files.
I'm also very interested in this. For static sites of anything larger than toy blogs, Pagefind just isn't scalable. Having to run npx pagefind
across a hundred thousand files every time a one of them gets updated or a new one gets added currently takes forever.
And even then, Pagefind doesn't support reading gzipped files, which is a requirement at that scale. I've found a workaround involving named pipes, but it's painfully slow and requires mirroring your build folder. It's not a solution that would work in a production environment.
Seconding this.
For my use case, I'd also want updates to minimize changes to the index, because re-uploading the whole index is expensive. I'm looking to use this for a logbot, which updates every few minutes; I definitely don't want to re-index all historical logs every time.
(Right now I'm using sql.js-httpvfs and splitting the database into fixed-size chunks; since sql is already optimized to ensure the database doesn't change more than necessary, most updates leave most chunks untouched.)
I've effectively worked around this by using the sharding functionality and partitioning my files chronologically. I segment files based on created date into the last week, last month, last year, and then one index per year afterwards. Then I point pagefind at all those indexes to search everything.
For my application, since I'm only usually creating files in the last week, or sometimes the last month, those are all I update and re-upload.
As dates roll around, I eventually update the older indexes, but that's relatively infrequent.
Nothing yet to report here, but I've had some recent conversations about this that are promising.
When implemented, it will likely be in a state where a full rebuild will be encouraged at some frequency. Since the incremental mode will aim to not change the chunk boundaries, over time some of the chunks might grow beyond an acceptable size, and a full rebuild to re-shard everything would be needed.
It does seem very feasible that we can do an incremental build that re-parses the indexes and adds some new pages into the existing chunks, though, so there is a good path forward here
Hey @rauschma just checking in. This work is bubbling closer to the surface, along with a possible Node API, so I'm starting to at least scope it out a bit more.
Is this something you still have a vested interest in? Any new thoughts or requirements from the above discussion?
Hi @bglw! Yes, I’m still very much interested in this! But it may take a while until I have time to add this functionality to my static site generator, which is why I don’t currently have any requirements.
@bakkot as I'm working on this further I'll note that it likely won't solve much of the issue around minimizing changes to the index.
Since the indexes are sharded alphabetically on the content, adding a page to the index will likely touch many or most of the shards, which will cause them to get new content/filenames. The existing page fragment files won't change, which are the most numerous, but the index files will likely need to be (mostly) reuploaded, even in incremental mode.
Thanks! That makes sense. I did a web search for full text search indices that can be updated incrementally and found these papers:
But I don’t know how well these ideas work for this indexer (given that appending to files doesn’t work here).
Removing this from the 1.0 epic — the NodeJS API will be stabilized in 1.0 but incremental has not yet been worked on.
It's still on my list to do some benchmarking and see how we can do incremental with Pagefind's architecture, and whether that would even be any faster than just re-indexing anyway.
I would also be very interested in this, and having it exposed via the Node API. I currently am forced to use the api because I am indexing custom files. I currently have ± 1k files and indexing them this way takes 3 minutes. So incremental would help a lot.
I’m really impressed by Pagefind. I’ve been wishing/looking for this kind of functionality for years.
I’d like to integrate Pagefind into a static site generator (SSG) that I’m working on (it’s not yet public):