Incrementally updating static search bundles?

rauschma commented 2 years ago

I’m really impressed by Pagefind. I’ve been wishing/looking for this kind of functionality for years.

I’d like to integrate Pagefind into a static site generator (SSG) that I’m working on (it’s not yet public):

That SSG has an incremental mode where it detects which input files have changed since the last generation and only re-generates output files affected by those input files.
Is it conceivable for Pagefind to support a similar mode?
- In my case, I’d love to have a Node.js API that I can call and provide with an Array of paths of files to re-index.
- If a Node.js API is not in the cards, I can live with invoking the CLI via a child process.

bglw commented 2 years ago

👋 Hello @rauschma!

Interesting! I can definitely see a couple of ways this could work, I have a few questions about what your ideal state would look like:

Would you want Pagefind to have its own incremental detection/handling, or would you be passing Pagefind a list of files to index that have been changed, and Pagefind should ignore the rest?
- i.e. npx pagefind --incremental vs npx pagefind --reindex-files <PATH,PATH,PATH>
Would you be wanting Pagefind to try and incrementally modify its built index, or would you be fine with Pagefind re-generating its index files?
- (In its current state, adding a new unique word to the index will shift and change all index chunks)

Regarding the Node.js API — that isn't out of the question (it would just be abstracting away the CLI-via-child-process step). The current pagefind npm package already does the postinstall work of installing the CLI binary, so it would be fairly trivial to add a node-facing API to that package that uses the same binary.

Love the sound of a new incremental SSG, happy to help shape Pagefind into a useful state for it! 🙂

rauschma commented 2 years ago

Would you want Pagefind to have its own incremental detection/handling, or would you be passing Pagefind a list of files to index that have been changed, and Pagefind should ignore the rest?

In my case, I’d like to tell Pagefind which files have changed (because the SSG detects that).
pagefind --incremental may be useful for others.

Would you be wanting Pagefind to try and incrementally modify its built index, or would you be fine with Pagefind re-generating its index files?

Whichever works for you! Pagefind could also initially regenerate everything and later optimize incremental generation (if possible). Users of Pagefind should not notice the difference(?)

In incremental mode, I’ll probably copy files that have changed and delete files that have disappeared. This approach should adapt to however Pagefind works.

Regarding the Node.js API — that isn't out of the question

If you have the time to do that then that’d be great! You having control over such an API also gives you the option to do more with it later.

Love the sound of a new incremental SSG, happy to help shape Pagefind into a useful state for it! 🙂

Cool, I’ll keep you posted!

rauschma commented 2 years ago

On second thought – 3 kinds of files exist and should be supported for incremental updates:

Changed files
- New files
Deleted files

bglw commented 2 years ago

Ah, yes, good call on the deleted files.

chrisspen commented 1 year ago

I'm also very interested in this. For static sites of anything larger than toy blogs, Pagefind just isn't scalable. Having to run npx pagefind across a hundred thousand files every time a one of them gets updated or a new one gets added currently takes forever.

And even then, Pagefind doesn't support reading gzipped files, which is a requirement at that scale. I've found a workaround involving named pipes, but it's painfully slow and requires mirroring your build folder. It's not a solution that would work in a production environment.

bakkot commented 1 year ago

Seconding this.

For my use case, I'd also want updates to minimize changes to the index, because re-uploading the whole index is expensive. I'm looking to use this for a logbot, which updates every few minutes; I definitely don't want to re-index all historical logs every time.

(Right now I'm using sql.js-httpvfs and splitting the database into fixed-size chunks; since sql is already optimized to ensure the database doesn't change more than necessary, most updates leave most chunks untouched.)

chrisspen commented 1 year ago

I've effectively worked around this by using the sharding functionality and partitioning my files chronologically. I segment files based on created date into the last week, last month, last year, and then one index per year afterwards. Then I point pagefind at all those indexes to search everything.

For my application, since I'm only usually creating files in the last week, or sometimes the last month, those are all I update and re-upload.

As dates roll around, I eventually update the older indexes, but that's relatively infrequent.

bglw commented 1 year ago

Nothing yet to report here, but I've had some recent conversations about this that are promising.

When implemented, it will likely be in a state where a full rebuild will be encouraged at some frequency. Since the incremental mode will aim to not change the chunk boundaries, over time some of the chunks might grow beyond an acceptable size, and a full rebuild to re-shard everything would be needed.

It does seem very feasible that we can do an incremental build that re-parses the indexes and adds some new pages into the existing chunks, though, so there is a good path forward here

bglw commented 1 year ago

Hey @rauschma just checking in. This work is bubbling closer to the surface, along with a possible Node API, so I'm starting to at least scope it out a bit more.

Is this something you still have a vested interest in? Any new thoughts or requirements from the above discussion?

rauschma commented 1 year ago

Hi @bglw! Yes, I’m still very much interested in this! But it may take a while until I have time to add this functionality to my static site generator, which is why I don’t currently have any requirements.

bglw commented 1 year ago

@bakkot as I'm working on this further I'll note that it likely won't solve much of the issue around minimizing changes to the index.

Since the indexes are sharded alphabetically on the content, adding a page to the index will likely touch many or most of the shards, which will cause them to get new content/filenames. The existing page fragment files won't change, which are the most numerous, but the index files will likely need to be (mostly) reuploaded, even in incremental mode.

rauschma commented 1 year ago

Thanks! That makes sense. I did a web search for full text search indices that can be updated incrementally and found these papers:

But I don’t know how well these ideas work for this indexer (given that appending to files doesn’t work here).

bglw commented 1 year ago

Removing this from the 1.0 epic — the NodeJS API will be stabilized in 1.0 but incremental has not yet been worked on.

It's still on my list to do some benchmarking and see how we can do incremental with Pagefind's architecture, and whether that would even be any faster than just re-indexing anyway.

holyjak commented 11 months ago

I would also be very interested in this, and having it exposed via the Node API. I currently am forced to use the api because I am indexing custom files. I currently have ± 1k files and indexing them this way takes 3 minutes. So incremental would help a lot.

CloudCannon / pagefind

Incrementally updating static search bundles? #71