Performance when pulling large data sets

cloudant / CDTDatastore

Cloudant Sync iOS datastore library.

Apache License 2.0

175 stars 53 forks source link

Performance when pulling large data sets #460

Closed davidperrenoud closed 2 years ago

davidperrenoud commented 4 years ago

I benchmarked pulling 100 000 simple documents using datastore.pull(). It took 146 seconds in the iOS simulator.

By comparison, inserting 100 000 documents (batches of 1000) takes 12 seconds. Querying them back at once by calling /_all_docs?include_docs=true takes 19 seconds. The remote database in /opt/couchdb/data weighs 16 Mb. The local SQLite database in my_datastore.touchdb similarly weighs 18.8 Mb.

The pull replication seems to be either disk or CPU bound, as it is writing to the disk at 50 MB/s and uses 75% of the CPU in TDReplicator. I have not profiled this further.

I have tried changing the following settings without success:

initialRevsCapacity: 1000
kInboxCapacity: 1000
initWithCapacity: 2000
kChangesFeedLimit: 1000u
kMaxRevsToGetInBulk: 500u
kMaxNumberOfAttsSince: 500u

I guess that these settings are only related to the network performance, which is good enough, and not the writing to disk of the many documents.

What can be tweaked on the remote CouchDB database or in my iOS app to improve the pull replication?

Or should I just switch to Core Data?

ricellis commented 4 years ago

I guess that these settings are only related to the network performance

Yes, in general CouchDB's replication protocol is chatty- there is a lot of back and forth between the client and server. Initial replications are often slow because there are usually more docs to pull - later replications go from a checkpoint (clearly this depends on circumstance, if your initial pull is from a DB with 0 docs and it then gets populated with 100000 before your second pull having the checkpoint isn't going to help).

Typically speeding up replication performance is done by reducing the number of request round trips needed (e.g. increasing the bulk batch size and/or changes feed limit - which it sounds like you've tried) and/or reducing the set of documents that need to replicated (e.g. using filters to only pull the necessary documents to any single device).

There is a fair amount of CPU work involved in processing the response bodies, breaking them down into the individual docs and extracting the necessary metadata for the SQL tables and so on. The local datastore maintains a revision tree so it also has to run through checks for new winning revisions etc. The documents are written into the local SQLite datastore in batches, but there is still transactional overhead around those. I've not seen these things be slower than the network before.

By comparison, inserting 100 000 documents (batches of 1000) takes 12 seconds. Querying them back at once by calling /_all_docs takes 4 seconds.

I guess these numbers are talking directly to the remote from the device without the local datastore. This sync library's primary use-case is keeping local and remote datastores in sync. I think the questions for the sync use case are:

Do you need to make offline document modifications and replicate them back to the server?
Do you need to resolve conflicts on device?
Do you need offline "mango" style queries?
Do you later want to pull updates only for documents that have changed remotely?

If the answers to these questions are "no" then some variation of pulling documents from the remote _all_docs (or a remote view or query) and caching them on device may well perform better than using the sync solution.

davidperrenoud commented 4 years ago

Thank you very much!

I now benchmarked indexing the 100 000 simple documents. It took 30 seconds.

With large datatsets is it correct that it is more efficient to have just one big index than many smaller indexes? At least that is what the performance example in query.md seems to imply. For example the following would be more efficient:

datastore.ensureIndexed(["type" "test" "other"], withName: "everything_index")

And the following would be slower:

datastore.ensureIndexed(["type"], withName: "type_index")
datastore.ensureIndexed(["test"], withName: "test_index")
datastore.ensureIndexed(["other"], withName: "other_index")

Right?

Alternatively, is it possible to defines views in CDTDatastore and generate indexes only on these views?

ricellis commented 4 years ago

With large datatsets is it correct that it is more efficient to have just one big index than many smaller indexes?

Yes, because from the perspective of building or updating the index it effectively reads each document once instead of 3 times (for your example).

ricellis commented 4 years ago

Alternatively, is it possible to defines views in CDTDatastore and generate indexes only on these views?

No, we don't have either local MR views or partial indexes in our sync libraries.

davidperrenoud commented 4 years ago

Is it correct to put all the field names together? For example, will the first index above be used when querying only on test or only on other?

From my tests, the following SQLite queries are made:

INSERT INTO _t_cloudant_sync_query_metadata (index_name, index_type, field_name, last_sequence)
VALUES ('everything_index', 'json', '_id', 0);
INSERT INTO _t_cloudant_sync_query_metadata (index_name, index_type, field_name, last_sequence)
VALUES ('everything_index', 'json', '_rev', 0);
INSERT INTO _t_cloudant_sync_query_metadata (index_name, index_type, field_name, last_sequence)
VALUES ('everything_index', 'json', 'type', 0);
INSERT INTO _t_cloudant_sync_query_metadata (index_name, index_type, field_name, last_sequence)
VALUES ('everything_index', 'json', 'test', 0);
INSERT INTO _t_cloudant_sync_query_metadata (index_name, index_type, field_name, last_sequence)
VALUES ('everything_index', 'json', 'other', 0);

CREATE TABLE "_t_cloudant_sync_query_index_everything_index
( "_id" NONE, "_rev" NONE, "type" NONE, "test" NONE, "other" NONE );

CREATE INDEX "_t_cloudant_sync_query_index_everything_index_index"
ON "_t_cloudant_sync_query_index_everything_index"
( "_id", "_rev", "type", "test", "other" );

Therefore, is the order of the field names important?

davidperrenoud commented 4 years ago

I did a DROP INDEX on this table and it does not seem to be relevant for performance as all the queries still take < 10 ms. What seems very relevant though is that this table should contain all the field names used in queries in order to avoid the worst case scenario of having to iterate on all documents. Is that correct?

ricellis commented 4 years ago

Is it correct to put all the field names together?

Sorry I should have been more explixit about this. It depends on the queries. If you need to query them independently then you may want 3 separate indexes.

For example, will the first index above be used when querying only on test or only on other?

The multi-field index would be used if you were querying:

type alone
type and test
type and test and other

If you queried test or other alone, or $or queries of the fields then the multi-field index won't help.

Therefore, is the order of the field names important?

Yes, the field order is important for index creation because it controls which queries will be able to utilize the index.

At query time the order of the fields doesn't matter, but the indexes that are available does. Basically you need to create indexes that match the queries you need to do most often.

If you need to query the fields independently then a separate index for each field will perform better at query time (at the expense of slower index builds ofc)
If you need to query using $and of all 3 fields then the multi-field index will be better
If you need to do both then you could create the 3 indexes:
- type test other
- test
- other

The SQLite query planner itself will choose only a single index to answer a query. Our translation between Cloudant style queries and SQL allows us to break a query apart to utilize multiple indexes to answer a query (e.g. if using $or conditions with multiple fields) so for those cases indexing the fields independently works better.

What seems very relevant though is that this table should contain all the field names used in queries in order to avoid the worst case scenario of having to iterate on all documents. Is that correct?

You definitely want to avoid it falling back to reading all documents - it logs a warning if it does that. Hopefully my answers above address the ordering situation.

davidperrenoud commented 4 years ago

If you queried test or other alone, or $or queries of the fields then the multi-field index won't help.

The index on the _t_cloudant_sync_query_index table would not help but the table itself would still be used to do a linear scan in SQLite, right?

If that is correct having a monster index with all the field names would allow to query in O(log n) on type, type and test, type and test and other and in O(n) on all the other combinations... which can still be fast enough for 100 000 elements (< 10 ms).

ricellis commented 4 years ago

The index on the _t_cloudant_sync_query_index table would not help but the table itself would still be used to do a linear scan in SQLite, right?

Yes, quite right, I was not clear about the distinction between the SQL index vs CDTDatastore index. As you point out, creating the CDTDatastore index does create an additional table with columns of extracted field values. If all the fields in a query are available from a CDTDatastore index then that table (and ofc the associated SQL indexes where possible) will be used.

having a monster index with all the field names

I think this should be OK for simple documents. The down-side of the monster index is that because it contains all fields it will be able to answer every query so there is no flexibility to gain performance on the "other combination" queries. It sounds like you are comfortable with the _O_(n) performance there though.