Closed basham closed 1 year ago
Update: I was able to optimize my code. Now, the 8 seconds of loading is down to 100ms. The slowdown happened because when I called queryDocs()
, I extracted only the paths I needed from that query. Then I passed those paths to another function that got and parsed the individual docs at those paths. This resulted in 686 calls to getLatestDocAtPath()
. Now, I pass the paths and the docs from the original query to the parsing function, to skip the calls to the getLatestDocAtPath()
. Again, ideally, getLatestDocAtPath()
should be constant time, so this sort of optimization shouldn't be as noticeable.
These optimizations should also improve saving documents into the replica. Within Replica.set()
, there is a call for getLatestDocAtPath()
and getAllDocsAtPath()
. Both of those should return at constant time, but the underlying queryDocs()
needs to be optimized for that to happen.
I have a script that imports documents into a replica, as a way to migrate data away from an old system. This is nearly 9,000 records, and it takes about 3 minutes to complete. I'd imagine some of this time could be reduced by making these optimizations.
Thank you for these pointers! Some of them seem like they won't be difficult to apply.
Glad you were able to optimise in the meantime. I've had to do similar things to get the letterbox layer to be more performant.
I've made optimisations 1, 3, and 4, and seen a test which ingests 8000 documents into a memory replica go from 42 seconds to 2 seconds. Optimising getLatestDocAtPath
and getAllDocsAtPath
seem to have made the biggest difference there. :)
Still figuring out #2
The thing about not specifying order not ordering anything would be a breaking change. Right now when no order is provided, it falls back to path ASC
.
There is a big improvement: the different between sorting by path ASC
and not sorting is 1ms vs 12 microseconds.
But I'm wondering which way to go:
NONE
option.Those are some amazing results!
My recommendation:
NONE
option, and clearly state whatever default you settle on in the documentation. You can do that in a minor release.ASC
, then just wait to do that particular change with an upcoming major release. Making it NONE
by default feels like it makes fewer assumptions about user-land, so that seems like it may be the right thing to do from a library perspective.
I'm profiling my browser app with the latest Earthstar 9.3.2. I have over 8,000 docs in a replica, and I'm noticing lots of performance issues with ReplicaDriverMemory.queryDocs(). On a single page, it takes about 8 seconds to run a loading script. I found some ways to optimize my code, but there are a lot of things that Earthstar can do to improve this as well:
{ historyMode: 'latest' }
, it calls_getLatestDocs()
. This function loops through all the documents, returning a new array. Instead, cache this result in a Map, like is done withdocByPathAndAuthor
anddocsByPathNewstFirst
. Update this Map duringupsert()
. That will make_getLatestDocs()
as efficient as_getAllDocs()
, when there are multiple calls toqueryDocs()
in the same session.orderBy
option was not explicitly declared in the query options. As a result, calls todocComparePathASCthenNewestFirst
resulted in a third of 8 seconds of computational time during this loading sequence. A particular sort based on the path may not be needed, because the sorting may be handled later or may be based on the doc contents. Sorting may be desired when also using thelimit
option, but this seems like a user concern. Sorting should be skipped if the sort option is not explicitly declared.path
, like in the case ofReplica.getLatestDocAtPath()
. Also, in this case, if you make a Map of the latest docs (as suggested in Item 1), then you can immediately return the doc in question in constant time (O(1)
). No need for loops anywhere in that query.O(n log n)
complexity, while filtering isO(n)
. It would be more efficient to Filter→Sort→Limit, than Sort→Filter→Limit.