cloudant / sync-android

A JSON-based document datastore for Android applications
Apache License 2.0
267 stars 91 forks source link

AttachmentNotSavedException - length mismatch #575

Closed ThomWright closed 6 years ago

ThomWright commented 6 years ago

I can test this out on v2 as well if that's useful, since we're in the process of migrating.

Problem description

We're seeing replication errors when syncing image attachments. This seems to be caused by the length property of the attachment stored in Couch not matching the number of bytes in the image returned over the network.

Originally this was caused by Cloudflare optimising images. We've turned this off, and confirmed that the images are now being sent untouched, but we still see errors.

I've deleted the local database and reinstalled the app in case anything was being cached, and I'm using a proxy to monitor the HTTP traffic.

I'm running out of ideas about what could be causing this :confused:

Exception from cloudant:

03-14 07:59:52.975: E/PullStrategy(3221): There was a problem downloading an attachment to the datastore, terminating replication
03-14 07:59:52.975: E/PullStrategy(3221): com.cloudant.sync.datastore.AttachmentNotSavedException:
  Actual length of 15273 does not equal expected length of 15482

From my proxy, confirming the correct body size of 15482 bytes:

Size
Request - 477 bytes
Response - 14.93 KB (15,284 bytes)
TLS Handshake - -
Header - 0 bytes
Cookies - -
Body - 14.93 KB (15,284 bytes)
Uncompressed Body - 15.12 KB (15,482 bytes)
Compression - 1.3% (gzip)
Total - 15.39 KB (15,761 bytes)

A curl command for the attachment, again confirming the length:

curl -s -H 'Host: couch-staging.candideapp.com' --compressed 'https://couch-staging.candideapp.com/knowledge-base-images/8cee16e3-438a-403c-a1ad-893578216ecf/image.jpg?rev=6-53ba3f969a4d74a5d8416521c3a26376' | wc -c

If there's any other useful information I can provide, let me know :slightly_smiling_face:

tomblench commented 6 years ago

@ThomWright it's due to gzip encoding. Is this something you can arrange to be turned off?

15:52:57 in java-cloudant$ curl -H accept-encoding:gzip  'https://couch-staging.candideapp.com/knowledge-base-images/8cee16e3-438a-403c-a1ad-893578216ecf/image.jpg' 2> /dev/null | wc -c
   15273
15:52:59 in java-cloudant$ curl 'https://couch-staging.candideapp.com/knowledge-base-images/8cee16e3-438a-403c-a1ad-893578216ecf/image.jpg' 2> /dev/null | wc -c
   15482

We use the length property from _attachments metadata which is 15482 because CouchDB doesn't know that you are transparently compressing the data. But we use accept-encoding:gzip because CouchDB can compress some media types and will signal this in metadata with "encoding":"gzip" and an encoded_length field. Note that jpg is not one of these types by default.

ThomWright commented 6 years ago

Thanks, that'll be it. I'll have a look at what we can do.

I assumed the data would have been transparently decompressed before it got to cloudant.

If an attachment is returned gzipped without the encoded_length property, can't it be decompressed and compared with length instead? This seems like a sensible thing to do since so many things might decide to helpfully (!) gzip in between the client and the couchdb instance.

tomblench commented 6 years ago

@ThomWright what you suggested would be a nice idea but to save time and disk space we don't decompress the attachment on the client side. We store it in its compressed form decompress the attachment stream on demand when the user wants to read it on the client side.

In order to implement your plan we'd have to decompress to /dev/null just to check the length which seems a bit wasteful to me.

For reference, the default list of compressible types is here, although if you are mostly storing images I don't suppose you are too concerned with tweaking that.

Did you manage to get the compression between CouchDB and the client switched off?

ThomWright commented 6 years ago

Yes thanks, I did manage to fix it, but I'd rather have the gzipping turned back on if possible.

I understand the desire to store the compressed version. However, to me it makes more sense to simply store the attachment as CouchDB sent it. If the encoded_length property exists then CouchDB compressed it, so store it compressed. Otherwise, the compression happened in-flight, so decompress it and store it, again just as CouchDB sent it. The length being checked can then be the length of the actual attachment being stored.

Assuming the encoded_length property only exists when CouchDB has compressed the attachment, we can know for sure whether we need to decompress and avoid needless decompression.

tomblench commented 6 years ago

Closing - stale issue and user has workaround