cloudant / sync-android

A JSON-based document datastore for Android applications
Apache License 2.0
267 stars 91 forks source link

DataStore batch insert #240

Closed aralizer closed 8 years ago

aralizer commented 8 years ago

Hi, I want to incorporate the cloudant/sync-android library in my Android application to store sensor data. I am using the following procedure:

  1. Collect the data in a temp file (JSON or sqlite db) after each collection interval,
  2. Transfer the lines to the Cloudant DataStore for encrypted storage.
  3. Upload the DataStore to a remote Cloudant DB, at certain intervals.

The problem I've encountered is that insertion times of records into the local DataStore are very long. Using the createDocumentFromRevision insertion method with the Byte array parameter, it takes about 3-4 minutes to insert 2-3K records one by one... Is there a way to 'batch-insert' a large amount of records? Or is there a way to 'merge-insert' a whole Sqlite db file?

tomblench commented 8 years ago

@Aralizer can you tell us what version of sync-android you are using? We have looked in the past at increasing performance and this may be an area where we can improve.

So we can understand your workload better, can you describe what your documents typically look like? We would like to know the typical document size in bytes, whether you have any attachments, and how regularly you intend to log data from the sensor (once per second, once per minute?)

Also you mention encryption, are you using the on-device encryption offered by our library? This will definitely impact on performance vs non-encrypted usage.

We currently don't intend to support any batch or merge APIs, as these are quite low-level and not in keeping with the current focus of the library. With that being said, it may be that other libraries or strategies might be more appropriate for an embedded situation like yours, especially if you don't intend to use the full two-way replication abilities of our library. For instance, if your application only ever sends sensor readings and does not ever receive data from a remote CouchDB/Cloudant instance, then I can suggest some alternative strategies.

aralizer commented 8 years ago

@tomblench I'm experimenting with the 'version:'latest.release'' version I get with the gradle import, as mentioned at the library's Readme section.

My documents are JSON records with the following structure: {sensor-name: gyro, timestamp: 1234567, value: {att1: 0, att2: 1, ....}} with no attachments. My tests were made with 3 sensors, probing duration of 1 second with an interval of 30 seconds. Each reading is inserted as the JSON mentioned above, directly to a temp file, as text. This kind of sensor logging results in around 1-1.5K of distinct records for each probing interval, and takes around 1MB of storage (as string data).

Every 60 seconds I run a service that reads this file as a list of strings, converts each of them to Byte array and inserts them to the DataStore with the createDocumentFromRevision method. The net insert rate (without reading the file and the byte array conversion) is about 10 records per seconds, without encryption (I didn't get to try the encryption yet).

You are correct, currently I intend to use the library only for one-way replication. Which alternatives are available?

tomblench commented 8 years ago

Hi @Aralizer - see https://gist.github.com/tomblench/da3709236282c52b18fb for an example of using the _bulk_docs REST endpoint using only the Android built-in classes. In testing I could achieve over 1000 records/second on my account. This may suit your performance demands better than using the sync library.

falmanna commented 7 years ago

Hi @tomblench ,

I need to do the same thing but store the data locally. And I can't find a bulk_insert API? is it possible to do it somehow?

tomblench commented 7 years ago

@firasalmanna sorry but there currently aren't any APIs to do bulk insert/update of local documents. Depending on demand for this feature and other priorities, we may consider implementing it in the future.

falmanna commented 7 years ago

@tomblench I did it by looping and inserting one at a time, but it is very slow. It took around 10sec to insert 50 objects 6KB each.
How do you handle bulk inserting in pull replication?

tomblench commented 7 years ago

@firasalmanna the short answer is that a list of revisions is inserted into the database by calling the underlying SQLite database with a set of SQL commands grouped into one transaction for the entire batch.

You can see the start of that process here and if you follow the code through all of the Callables you can see all the gory logic for deciding whether to insert into a new or existing rev tree, dealing with attachments, etc.

Theoretically the logic would be a little bit simpler for a batch insert/update on the local database (because we don't have to deal with conflicts - which we do have to consider when pull replicating), but it wouldn't be trivial to write.