Performance issues with queryDocs()

basham commented 2 years ago

I'm profiling my browser app with the latest Earthstar 9.3.2. I have over 8,000 docs in a replica, and I'm noticing lots of performance issues with ReplicaDriverMemory.queryDocs(). On a single page, it takes about 8 seconds to run a loading script. I found some ways to optimize my code, but there are a lot of things that Earthstar can do to improve this as well:

With the option { historyMode: 'latest' }, it calls _getLatestDocs(). This function loops through all the documents, returning a new array. Instead, cache this result in a Map, like is done with docByPathAndAuthor and docsByPathNewstFirst. Update this Map during upsert(). That will make _getLatestDocs() as efficient as _getAllDocs(), when there are multiple calls to queryDocs() in the same session.
Docs are sorted every time, even if the orderBy option was not explicitly declared in the query options. As a result, calls to docComparePathASCthenNewestFirst resulted in a third of 8 seconds of computational time during this loading sequence. A particular sort based on the path may not be needed, because the sorting may be handled later or may be based on the doc contents. Sorting may be desired when also using the limit option, but this seems like a user concern. Sorting should be skipped if the sort option is not explicitly declared.
Sorting should also be skipped when the query is being filtered by a particular path, like in the case of Replica.getLatestDocAtPath() . Also, in this case, if you make a Map of the latest docs (as suggested in Item 1), then you can immediately return the doc in question in constant time (O(1)). No need for loops anywhere in that query.
Array sorting is O(n log n) complexity, while filtering is O(n). It would be more efficient to Filter→Sort→Limit, than Sort→Filter→Limit.

basham commented 2 years ago

Update: I was able to optimize my code. Now, the 8 seconds of loading is down to 100ms. The slowdown happened because when I called queryDocs(), I extracted only the paths I needed from that query. Then I passed those paths to another function that got and parsed the individual docs at those paths. This resulted in 686 calls to getLatestDocAtPath(). Now, I pass the paths and the docs from the original query to the parsing function, to skip the calls to the getLatestDocAtPath(). Again, ideally, getLatestDocAtPath() should be constant time, so this sort of optimization shouldn't be as noticeable.

basham commented 2 years ago

These optimizations should also improve saving documents into the replica. Within Replica.set(), there is a call for getLatestDocAtPath() and getAllDocsAtPath(). Both of those should return at constant time, but the underlying queryDocs() needs to be optimized for that to happen.

I have a script that imports documents into a replica, as a way to migrate data away from an old system. This is nearly 9,000 records, and it takes about 3 minutes to complete. I'd imagine some of this time could be reduced by making these optimizations.

sgwilym commented 2 years ago

Thank you for these pointers! Some of them seem like they won't be difficult to apply.

Glad you were able to optimise in the meantime. I've had to do similar things to get the letterbox layer to be more performant.

sgwilym commented 2 years ago

I've made optimisations 1, 3, and 4, and seen a test which ingests 8000 documents into a memory replica go from 42 seconds to 2 seconds. Optimising getLatestDocAtPath and getAllDocsAtPath seem to have made the biggest difference there. :)

Still figuring out #2

sgwilym commented 2 years ago

The thing about not specifying order not ordering anything would be a breaking change. Right now when no order is provided, it falls back to path ASC.

There is a big improvement: the different between sorting by path ASC and not sorting is 1ms vs 12 microseconds.

But I'm wondering which way to go:

Make it so that no order is applied when no order is given (which makes some sense)
Add an explicit NONE option.

basham commented 2 years ago

Those are some amazing results!

My recommendation:

Add an explicit NONE option, and clearly state whatever default you settle on in the documentation. You can do that in a minor release.
If you choose to change the default away from ASC, then just wait to do that particular change with an upcoming major release. Making it NONE by default feels like it makes fewer assumptions about user-land, so that seems like it may be the right thing to do from a library perspective.

earthstar-project / earthstar

Performance issues with queryDocs() #255