cloudant / sync-android

A JSON-based document datastore for Android applications
Apache License 2.0
267 stars 91 forks source link

Attachments Pull Test Failure #381

Closed rhyshort closed 7 years ago

rhyshort commented 7 years ago

Slightly different failure this time:

com.cloudant.sync.replication.AttachmentsPullTest > pullRevisionsWithBigAttachments[0] FAILED
    java.lang.AssertionError
        at org.junit.Assert.fail(Assert.java:86)
        at org.junit.Assert.assertTrue(Assert.java:41)
        at org.junit.Assert.assertTrue(Assert.java:52)
        at com.cloudant.sync.replication.AttachmentsPullTest.pull(AttachmentsPullTest.java:318)
        at com.cloudant.sync.replication.AttachmentsPullTest.pullRevisionsWithBigAttachments(AttachmentsPullTest.java:103)
rhyshort commented 7 years ago

https://travis-ci.org/cloudant/sync-android/builds/170408761

ricellis commented 7 years ago

The assertion that failed was for whether the replication completed. Knowing it doesn't complete is not as helpful as seeing what error prevented the completion. I propose changing replication tests that assert replication finished/did not error to instead throw on error before asserting that finish was called.

ricellis commented 7 years ago

Although I merged that extra debug it turns out that some of the other tests decide to catch the exceptions and hide the causes e.g.:

com.cloudant.sync.replication.AttachmentsPullTest > pullRevisionsWithBigAttachments[0] FAILED java.lang.AssertionError: Create/pull error java.lang.Exception: Replication errored at org.junit.Assert.fail(Assert.java:88) at com.cloudant.sync.replication.AttachmentsPullTest.pullRevisionsWithBigAttachments(AttachmentsPullTest.java:105)

This means we'll need some more test changes to try and expose the real errors.

ricellis commented 7 years ago

Despite the changes from #375 the database exists issue turned up again in https://travis-ci.org/cloudant/sync-android/builds/170758346:

com.cloudant.sync.replication.AttachmentsPullTest > dontPullAttachmentAlreadyPulled[0] FAILED CouchException: error: file_exists, reason: The database could not be created, the file already exists., statusCode: 412, msg: Precondition Failed, cause: java.io.IOException: Server returned HTTP response code: 412 for URL: http://127.0.0.1:5984/attachments_pull_test1477485249807 Caused by: java.io.IOException: Server returned HTTP response code: 412 for URL: http://127.0.0.1:5984/attachments_pull_test1477485249807

We have tended to consider this a different failure mode to that reported in this issue, however, the deletion of a database with a duplicate name at some point after a subsequent test started replicating could cause the replication error/assertion failure seen here (since we don't know the exact error) so the two issues may still be related.

I guess it is possible that System.currentTimeMillis() has an insufficient level of uniqueness for the Travis runs maybe being impacted by wall clock time changes or something. I can't see any other obvious reason why the DB name would not be unique within the test scope.

ricellis commented 7 years ago

With the additional debug (and increased uniqueness of database naming) I saw this failure:

com.cloudant.sync.replication.AttachmentsPullTest > pullRevisionsWithBigAttachments[0] FAILED java.lang.Exception: Replication errored at com.cloudant.sync.util.TestEventListener.assertReplicationCompletedOrThrow(TestEventListener.java:31) at com.cloudant.sync.replication.AttachmentsPullTest.pull(AttachmentsPullTest.java:291) at com.cloudant.sync.replication.AttachmentsPullTest.pullRevisionsWithBigAttachments(AttachmentsPullTest.java:94) Caused by: java.lang.RuntimeException: Problem getting response from queue because the original request threw an exception: at com.cloudant.sync.replication.GetRevisionTaskThreaded$GetRevisionTaskIterator.next(GetRevisionTaskThreaded.java:193) at com.cloudant.sync.replication.GetRevisionTaskThreaded$GetRevisionTaskIterator.next(GetRevisionTaskThreaded.java:154) at com.cloudant.sync.replication.PullStrategy.processOneChangesBatch(PullStrategy.java:323) at com.cloudant.sync.replication.PullStrategy.replicate(PullStrategy.java:252) at com.cloudant.sync.replication.PullStrategy.run(PullStrategy.java:177) at com.cloudant.sync.replication.AttachmentsPullTest.pull(AttachmentsPullTest.java:290) ... 1 more Caused by: java.lang.RuntimeException: java.net.SocketException: Socket closed Caused by: java.net.SocketException: Socket closed

If we merge the additional debug, we can see if all the different failure modes are backed by this type of exception.

ricellis commented 7 years ago

com.cloudant.sync.replication.AttachmentsPullTest > pullRevisionsWithBigTextAttachments[0] FAILED

CouchException: error: file_exists, reason: The database could not be created, the file already exists., statusCode: 412, msg: Precondition Failed, cause: java.io.IOException: Server returned HTTP response code: 412 for URL: http://127.0.0.1:5984/attachments_pull_test14786923116128259212773175

    Caused by:

    java.io.IOException: Server returned HTTP response code: 412 for URL: http://127.0.0.1:5984/attachments_pull_test14786923116128259212773175
rhyshort commented 7 years ago

Another failure mode:

com.cloudant.sync.replication.AttachmentsPullTest > pullRevisionsWithBigAttachments[0] FAILED
    java.lang.AssertionError: Attachment is null
        at org.junit.Assert.fail(Assert.java:88)
        at org.junit.Assert.assertTrue(Assert.java:41)
        at org.junit.Assert.assertNotNull(Assert.java:621)
        at com.cloudant.sync.replication.AttachmentsPullTest.pullRevisionsWithBigAttachments(AttachmentsPullTest.java:108)

Seen again on Travis

ricellis commented 7 years ago

Saw another socket closed replication error on a Jenkins build.

rhyshort commented 7 years ago

Another failure on jenkins build

ricellis commented 7 years ago

I'm reasonably convinced now that these intermittent errors are due to network/connectivity issues. As described in #456 the retry logic only extends as far as getting HTTP response headers and status code, but does not apply during the consumption of the response body stream. I have a proposal for extending the scope of the retry in that PR, but it may be that we want to extract the retry logic into the cloudant-http library.

ricellis commented 7 years ago

The 412 precondition failed database already exists failure mode of this test is caused by a different problem with our retry logic. There is a failure window where it is possible that the server receives the request, but we fail to receive the response. In the database create case the subsequent retry receives a 412 since the server has already received the request and created the database. The PUT request is idempotent, but returns a different response code for cases where it exists vs was actually created.

At this point I don't understand why this test is so much more vulnerable than others to this type of problem.

Nonetheless there are at least two things to address:

  1. Given that the remote database create is not API for this library I think the test code should handle the 412 response and proceed normally.
  2. Validate that our current behaviour does not incorrectly retry requests that it should not (e.g. non-idempotent methods) and that we appropriately handle alternate response codes in cases where we retry idempotent requests.
ricellis commented 7 years ago

I think the changes in #456 are covering this reasonably well now. After it merges and runs for a while we'll have a better idea if there are still problems.