Upsert documents in bulk

basham commented 2 years ago

I have a script that imports documents into a replica (Earthstar v9.3.2), as a way to migrate data away from an old system. This is nearly 9,000 records (5.4 MB), and it takes about 3 minutes to complete. In addition to the optimizations I've already reported, I wonder if there's a way to upsert docs in bulk.

I've also noticed that when syncing this replica between this local Deno import script to a local Deno replica server, that it also takes about the same 3 minutes. So, I suspect the bottleneck in this particular case is saving the documents, not syncing them.

This feature is something that would need to be done for each replica driver.

ReplicaDriverIndexedDB.upsert() could be debounced (as is mentioned in the comments) and then bulk added in a single transaction. The project idb-keyval does this with its setMany() function. That could be used as a template for how to solve this for that driver.

ReplicaDriverSqlite.upsert() could likely be approached in a similar way: debounce, accumulate the documents, and use a transaction. The better-sqlite project (which Earthstar uses) shows how to do this with its transaction() function.

Perhaps one thing to look out for (in case it is an issue): How does debounced upserts work alongside queryDocs()? If a doc or collection of docs are being injested and upserted, should calls from queryDocs() be delayed until that is complete? Maybe this sequencing is more of a user-land concern, though.

Also, instead of debouncing the documents, consider throttling them. With debouncing, you accumulate until there's a pause. This could result in heavy memory usage or unsaved documents, if it is long running or heavily used. When throttling, you would upsert a collection of documents at regular intervals. This seems like a safer way to handle it.

sgwilym commented 2 years ago

A lot of that time is probably also the new documents being signed and ingested, too.

Ingestion in particular takes longer because documents are ingested one-by-one, causing backpressure. The motivation for this is that ingestion was basically impossible to reason about if documents were being written concurrently: we need to know things like what the previous latest version of the doc was.

This is also a reason we can't do bulk upserts: during ingestion we need to read the latest documents from the replica. If we happened to be bulk upserting several documents with the same path, we'd get back unexpected results.

These big imports you do might just need to be something you need to grab a cup of tea for.

sgwilym commented 2 years ago

I have numbers now. They're pretty damning!

Apparently the Sqlite lib we use for Deno is known to have slow performance, especially on a Mac:

https://github.com/dyedgreen/deno-sqlite/issues/179 https://github.com/dyedgreen/deno-sqlite/issues/165

There's another Sqlite library which uses native bindings (https://github.com/denodrivers/sqlite3), but FFI in Deno is behind the unstable flag. I don't want to require all Earthstar users on Deno to use that flag, so the best we could do is offer it as an alternative Sqlite driver. Which I think wouldn't be too hard to add.

basham commented 2 years ago

I'm doing some performance measurements on my import script, regarding Replica.set(). I'm finding some interesting results, using different drivers and number of docs. Each row in this table represent the worst-timed results in my samples; most of the time, it executed slightly faster with subsequent runs.

Driver	Doc count	Total set() time	Avg set() time
Memory	370	1048ms	3ms
Sqlite	370	5600ms	15ms
Memory	1056	3420ms	3ms
Sqlite	1056	14970ms	14ms
Memory	8888	267720ms	30ms
Sqlite	8888	177942ms	19ms

Thoughts:

The import script is running in Deno, so I can't also test this script with IndexedDB, because that's in the browser. But because the IndexedDB driver extends the memory driver, the IndexedDB driver performance should be some amount worse than the memory driver.
The Sqlite driver timing slightly increases at scale. I wonder if the timing will be even more stable (around 14ms to 15ms), once the queryDocs() returns paths in constant time (#255).
Perhaps Sqlite could be faster if the upsert queries were throttled and run inside a database transaction. If doing this, queryDocs() may need to be delayed until an active throttle/transaction completes, so the new docs can be utilized in the results.
The memory driver eventually becomes slower than the Sqlite driver. However, the memory driver's upsert function doesn't reveal anything that looks concerning. With my import script, all doc paths are unique, so there isn't a history to manage. Given that, I would expect the docsByPath.filter() and docsByPath.sort() would result in effectively no execution time. So, I don't know why this scaling issue is happening.

These big imports you do might just need to be something you need to grab a cup of tea for.

Because this is a lengthy process, there should be some easy way to monitor the status of any sync operation. For example, the user should know that there are 800 documents being downloaded and 120 of them have been fully processed.

I don't mind a lengthy import script, as that's a one-time process. But I do mind that it will take a long time to sync thousands of documents (that total up to just a few MBs) in browser clients.

Ingestion in particular takes longer because documents are ingested one-by-one, causing backpressure.

I understand that there needs to be more caution when ingesting documents into an already populated replica. However, I'm more concerned about the time required for the initial (big) sync than subsequent syncs. If you sync documents into an empty replica (the initial sync), couldn't Earthstar just ignore the ingestion precautions and go straight to the upsert? And if the upserts are throttled into database transactions, that could also save time. That would effectively be like importing a database from a SQL dump file.

It just seems like there's got to be a way to ingest documents at a faster rate than the network can send them. I am just not confident where the most costly parts of the process are, to know where to focus. I'm happy to help research this problem. I just need guidance.

sgwilym commented 2 years ago

Because this is a lengthy process, there should be some easy way to https://github.com/earthstar-project/replica-server/issues/10. For example, the user should know that there are 800 documents being downloaded and 120 of them have been fully processed.

Totally. This is going to be a lot easier for me to implement soon, and the next major version will have something like this.

If you sync documents into an empty replica (the initial sync), couldn't Earthstar just ignore the ingestion precautions and go straight to the upsert?

This would skip the previous document checks, but it would also skip document validation, which would open the door to invalid documents being stored in the replica. You could upsert a document claiming to be written by someone else, or change the contents of a doc written by someone else. For me this is too big of a harassment vector to even contemplate.

And if the upserts are throttled into database transactions, that could also save time.

I'm afraid that the existing state of a replica is too important to setting documents for this to be possible.

When you set a document:

It looks for the previous version of the document and its timestamp, so that it can make sure the new document wins if there's no timestamp set.
It gets the previous latest version of the document to compare timestamps
It gets the previous latest version of the document by the same author to compare timestamps
It gets the maximum local index of all documents so that it can write the correct one to the document (i.e. its number in the sequence of docs being ingested).

All of these things would be way, way harder to do if upserts were batched. And these things have to be done for Earthstar to work the way it does.

I think the best way to proceed is to make the current process faster, with faster crypto and replica drivers, and optimisations like the ones you've helpfully suggested in #255.

We have good leads for the drivers, too. I've already got a new crypto driver for Deno which makes signing and validation 15x faster. I think it's worth shipping an unstable FFI sqlite driver to see what the speed benefits are, too.

basham commented 2 years ago

I think the best way to proceed is to make the current process faster, with faster crypto and replica drivers, and optimisations like the ones you've helpfully suggested in https://github.com/earthstar-project/earthstar/issues/255.

Thanks for outlining your concerns, providing guidance on where to focus, and taking time to explore options. Feel free to close this issue.

sgwilym commented 2 years ago

Just to return to quickly add something to this: I tried putting together a Sqlite FFI driver.

The good news:

and the bad news:

You can't win sometimes...

basham commented 2 years ago

Wow, that's unfortunate. I would have expected the C-based sqlite3 to be faster than the WASM-based sqlite library. I wonder if the slowdown is happening in the Deno FFI and if the binding performance can be improved in future Deno updates.

sgwilym commented 2 years ago

it has now been suggested to me twice that I use WASM sqlite to read, and FFI to write, which sounds completely insane but now I'm curious if it'd even work

sgwilym commented 2 years ago

Some positives though...

getLatestDocAtPath and getAllDocsAtPath remain very fast in both sqlite drivers.
for the usecase of a replica server the FFI driver is still miles better: per sync session the replica's contents are only read once and new ingestions are streamed in after that. But writes happen a lot.

basham commented 2 years ago

I think the best way to proceed is to make the current process faster, with faster crypto and replica drivers, and optimisations like the ones you've helpfully suggested in #255.

I've been researching this week, and I came across the absurd-sql project. It is described in the article A future for SQL on the web. It provides a persistence layer for sqlite in the browser, through IndexedDB.

It implements a backend for sql.js (sqlite3 compiled for the web) that treats IndexedDB like a disk and stores data in blocks there. That means your sqlite3 database is persisted. And not in the terrible way of reading and writing the whole image at once -- it reads and writes your db in small chunks.

You'd think that the performance of such an abstraction would be slow. But actually "it consistently beats IndexedDB performance up to 10x." The article elaborates on the techniques it uses to make this possible.

I really like that it provides a way to avoid loading the entire database into memory, unlike ReplicaDriverIndexedDB (if I'm recalling correctly).

I wonder if there are any approaches that could be adopted into Earthstar based on this work.

sgwilym commented 2 years ago

Yeah, I've been hoping for this approach to develop further. There's experimental support for this in the Deno Sqlite module we use, too: https://github.com/dyedgreen/deno-sqlite#browser-version-experimental

Hurdles with this approach which I've noticed:

aforementioned experimental-ness
absurd-sql must be run in a worker
the WASM versions of sqlite tend to be quite large

The next milestone will have a 'real' IndexedDB driver which does not store stuff in memory, but will be slower. Whether some creative solutions need to be found depends on how much slower.

sgwilym commented 1 year ago

Here's where replica performance stands now that the development cycle on v10 is closing:

Sqlite FFI querying performance has improved so it is at parity with non-FFI Sqlite.

Strangely enough, write performance for the non-FFI version has improved so that it's at parity with FFI Sqlite.

So they're... just about the same now?

earthstar-project / earthstar

Upsert documents in bulk #256