Closed basham closed 1 year ago
A lot of that time is probably also the new documents being signed and ingested, too.
Ingestion in particular takes longer because documents are ingested one-by-one, causing backpressure. The motivation for this is that ingestion was basically impossible to reason about if documents were being written concurrently: we need to know things like what the previous latest version of the doc was.
This is also a reason we can't do bulk upserts: during ingestion we need to read the latest documents from the replica. If we happened to be bulk upserting several documents with the same path, we'd get back unexpected results.
These big imports you do might just need to be something you need to grab a cup of tea for.
I have numbers now. They're pretty damning!
Apparently the Sqlite lib we use for Deno is known to have slow performance, especially on a Mac:
https://github.com/dyedgreen/deno-sqlite/issues/179 https://github.com/dyedgreen/deno-sqlite/issues/165
There's another Sqlite library which uses native bindings (https://github.com/denodrivers/sqlite3), but FFI in Deno is behind the unstable
flag. I don't want to require all Earthstar users on Deno to use that flag, so the best we could do is offer it as an alternative Sqlite driver. Which I think wouldn't be too hard to add.
I'm doing some performance measurements on my import script, regarding Replica.set()
. I'm finding some interesting results, using different drivers and number of docs. Each row in this table represent the worst-timed results in my samples; most of the time, it executed slightly faster with subsequent runs.
Driver | Doc count | Total set() time | Avg set() time |
---|---|---|---|
Memory | 370 | 1048ms | 3ms |
Sqlite | 370 | 5600ms | 15ms |
Memory | 1056 | 3420ms | 3ms |
Sqlite | 1056 | 14970ms | 14ms |
Memory | 8888 | 267720ms | 30ms |
Sqlite | 8888 | 177942ms | 19ms |
Thoughts:
queryDocs()
returns paths in constant time (#255).queryDocs()
may need to be delayed until an active throttle/transaction completes, so the new docs can be utilized in the results.docsByPath.filter()
and docsByPath.sort()
would result in effectively no execution time. So, I don't know why this scaling issue is happening.These big imports you do might just need to be something you need to grab a cup of tea for.
Because this is a lengthy process, there should be some easy way to monitor the status of any sync operation. For example, the user should know that there are 800 documents being downloaded and 120 of them have been fully processed.
I don't mind a lengthy import script, as that's a one-time process. But I do mind that it will take a long time to sync thousands of documents (that total up to just a few MBs) in browser clients.
Ingestion in particular takes longer because documents are ingested one-by-one, causing backpressure.
I understand that there needs to be more caution when ingesting documents into an already populated replica. However, I'm more concerned about the time required for the initial (big) sync than subsequent syncs. If you sync documents into an empty replica (the initial sync), couldn't Earthstar just ignore the ingestion precautions and go straight to the upsert? And if the upserts are throttled into database transactions, that could also save time. That would effectively be like importing a database from a SQL dump file.
It just seems like there's got to be a way to ingest documents at a faster rate than the network can send them. I am just not confident where the most costly parts of the process are, to know where to focus. I'm happy to help research this problem. I just need guidance.
Because this is a lengthy process, there should be some easy way to https://github.com/earthstar-project/replica-server/issues/10. For example, the user should know that there are 800 documents being downloaded and 120 of them have been fully processed.
Totally. This is going to be a lot easier for me to implement soon, and the next major version will have something like this.
If you sync documents into an empty replica (the initial sync), couldn't Earthstar just ignore the ingestion precautions and go straight to the upsert?
This would skip the previous document checks, but it would also skip document validation, which would open the door to invalid documents being stored in the replica. You could upsert a document claiming to be written by someone else, or change the contents of a doc written by someone else. For me this is too big of a harassment vector to even contemplate.
And if the upserts are throttled into database transactions, that could also save time.
I'm afraid that the existing state of a replica is too important to setting documents for this to be possible.
When you set a document:
All of these things would be way, way harder to do if upserts were batched. And these things have to be done for Earthstar to work the way it does.
I think the best way to proceed is to make the current process faster, with faster crypto and replica drivers, and optimisations like the ones you've helpfully suggested in #255.
We have good leads for the drivers, too. I've already got a new crypto driver for Deno which makes signing and validation 15x faster. I think it's worth shipping an unstable FFI sqlite driver to see what the speed benefits are, too.
I think the best way to proceed is to make the current process faster, with faster crypto and replica drivers, and optimisations like the ones you've helpfully suggested in https://github.com/earthstar-project/earthstar/issues/255.
Thanks for outlining your concerns, providing guidance on where to focus, and taking time to explore options. Feel free to close this issue.
Just to return to quickly add something to this: I tried putting together a Sqlite FFI driver.
The good news:
and the bad news:
You can't win sometimes...
Wow, that's unfortunate. I would have expected the C-based sqlite3 to be faster than the WASM-based sqlite library. I wonder if the slowdown is happening in the Deno FFI and if the binding performance can be improved in future Deno updates.
it has now been suggested to me twice that I use WASM sqlite to read, and FFI to write, which sounds completely insane but now I'm curious if it'd even work
Some positives though...
I think the best way to proceed is to make the current process faster, with faster crypto and replica drivers, and optimisations like the ones you've helpfully suggested in #255.
I've been researching this week, and I came across the absurd-sql project. It is described in the article A future for SQL on the web. It provides a persistence layer for sqlite in the browser, through IndexedDB.
It implements a backend for sql.js (sqlite3 compiled for the web) that treats IndexedDB like a disk and stores data in blocks there. That means your sqlite3 database is persisted. And not in the terrible way of reading and writing the whole image at once -- it reads and writes your db in small chunks.
You'd think that the performance of such an abstraction would be slow. But actually "it consistently beats IndexedDB performance up to 10x." The article elaborates on the techniques it uses to make this possible.
I really like that it provides a way to avoid loading the entire database into memory, unlike ReplicaDriverIndexedDB (if I'm recalling correctly).
I wonder if there are any approaches that could be adopted into Earthstar based on this work.
Yeah, I've been hoping for this approach to develop further. There's experimental support for this in the Deno Sqlite module we use, too: https://github.com/dyedgreen/deno-sqlite#browser-version-experimental
Hurdles with this approach which I've noticed:
The next milestone will have a 'real' IndexedDB driver which does not store stuff in memory, but will be slower. Whether some creative solutions need to be found depends on how much slower.
Here's where replica performance stands now that the development cycle on v10 is closing:
Sqlite FFI querying performance has improved so it is at parity with non-FFI Sqlite.
Strangely enough, write performance for the non-FFI version has improved so that it's at parity with FFI Sqlite.
So they're... just about the same now?
I have a script that imports documents into a replica (Earthstar v9.3.2), as a way to migrate data away from an old system. This is nearly 9,000 records (5.4 MB), and it takes about 3 minutes to complete. In addition to the optimizations I've already reported, I wonder if there's a way to upsert docs in bulk.
I've also noticed that when syncing this replica between this local Deno import script to a local Deno replica server, that it also takes about the same 3 minutes. So, I suspect the bottleneck in this particular case is saving the documents, not syncing them.
This feature is something that would need to be done for each replica driver.
ReplicaDriverIndexedDB.upsert()
could be debounced (as is mentioned in the comments) and then bulk added in a single transaction. The projectidb-keyval
does this with itssetMany()
function. That could be used as a template for how to solve this for that driver.ReplicaDriverSqlite.upsert()
could likely be approached in a similar way: debounce, accumulate the documents, and use a transaction. Thebetter-sqlite
project (which Earthstar uses) shows how to do this with itstransaction()
function.Perhaps one thing to look out for (in case it is an issue): How does debounced upserts work alongside
queryDocs()
? If a doc or collection of docs are being injested and upserted, should calls fromqueryDocs()
be delayed until that is complete? Maybe this sequencing is more of a user-land concern, though.Also, instead of debouncing the documents, consider throttling them. With debouncing, you accumulate until there's a pause. This could result in heavy memory usage or unsaved documents, if it is long running or heavily used. When throttling, you would upsert a collection of documents at regular intervals. This seems like a safer way to handle it.