Closed rhyshort closed 7 years ago
The assertion that failed was for whether the replication completed. Knowing it doesn't complete is not as helpful as seeing what error prevented the completion. I propose changing replication tests that assert replication finished/did not error to instead throw on error before asserting that finish was called.
Although I merged that extra debug it turns out that some of the other tests decide to catch the exceptions and hide the causes e.g.:
com.cloudant.sync.replication.AttachmentsPullTest > pullRevisionsWithBigAttachments[0] FAILED java.lang.AssertionError: Create/pull error java.lang.Exception: Replication errored at org.junit.Assert.fail(Assert.java:88) at com.cloudant.sync.replication.AttachmentsPullTest.pullRevisionsWithBigAttachments(AttachmentsPullTest.java:105)
This means we'll need some more test changes to try and expose the real errors.
Despite the changes from #375 the database exists issue turned up again in https://travis-ci.org/cloudant/sync-android/builds/170758346:
com.cloudant.sync.replication.AttachmentsPullTest > dontPullAttachmentAlreadyPulled[0] FAILED CouchException: error: file_exists, reason: The database could not be created, the file already exists., statusCode: 412, msg: Precondition Failed, cause: java.io.IOException: Server returned HTTP response code: 412 for URL: http://127.0.0.1:5984/attachments_pull_test1477485249807 Caused by: java.io.IOException: Server returned HTTP response code: 412 for URL: http://127.0.0.1:5984/attachments_pull_test1477485249807
We have tended to consider this a different failure mode to that reported in this issue, however, the deletion of a database with a duplicate name at some point after a subsequent test started replicating could cause the replication error/assertion failure seen here (since we don't know the exact error) so the two issues may still be related.
I guess it is possible that System.currentTimeMillis()
has an insufficient level of uniqueness for the Travis runs maybe being impacted by wall clock time changes or something. I can't see any other obvious reason why the DB name would not be unique within the test scope.
With the additional debug (and increased uniqueness of database naming) I saw this failure:
com.cloudant.sync.replication.AttachmentsPullTest > pullRevisionsWithBigAttachments[0] FAILED java.lang.Exception: Replication errored at com.cloudant.sync.util.TestEventListener.assertReplicationCompletedOrThrow(TestEventListener.java:31) at com.cloudant.sync.replication.AttachmentsPullTest.pull(AttachmentsPullTest.java:291) at com.cloudant.sync.replication.AttachmentsPullTest.pullRevisionsWithBigAttachments(AttachmentsPullTest.java:94) Caused by: java.lang.RuntimeException: Problem getting response from queue because the original request threw an exception: at com.cloudant.sync.replication.GetRevisionTaskThreaded$GetRevisionTaskIterator.next(GetRevisionTaskThreaded.java:193) at com.cloudant.sync.replication.GetRevisionTaskThreaded$GetRevisionTaskIterator.next(GetRevisionTaskThreaded.java:154) at com.cloudant.sync.replication.PullStrategy.processOneChangesBatch(PullStrategy.java:323) at com.cloudant.sync.replication.PullStrategy.replicate(PullStrategy.java:252) at com.cloudant.sync.replication.PullStrategy.run(PullStrategy.java:177) at com.cloudant.sync.replication.AttachmentsPullTest.pull(AttachmentsPullTest.java:290) ... 1 more Caused by: java.lang.RuntimeException: java.net.SocketException: Socket closed Caused by: java.net.SocketException: Socket closed
If we merge the additional debug, we can see if all the different failure modes are backed by this type of exception.
com.cloudant.sync.replication.AttachmentsPullTest > pullRevisionsWithBigTextAttachments[0] FAILED
CouchException: error: file_exists, reason: The database could not be created, the file already exists., statusCode: 412, msg: Precondition Failed, cause: java.io.IOException: Server returned HTTP response code: 412 for URL: http://127.0.0.1:5984/attachments_pull_test14786923116128259212773175
Caused by:
java.io.IOException: Server returned HTTP response code: 412 for URL: http://127.0.0.1:5984/attachments_pull_test14786923116128259212773175
Another failure mode:
com.cloudant.sync.replication.AttachmentsPullTest > pullRevisionsWithBigAttachments[0] FAILED
java.lang.AssertionError: Attachment is null
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.assertTrue(Assert.java:41)
at org.junit.Assert.assertNotNull(Assert.java:621)
at com.cloudant.sync.replication.AttachmentsPullTest.pullRevisionsWithBigAttachments(AttachmentsPullTest.java:108)
Seen again on Travis
Saw another socket closed replication error on a Jenkins build.
Another failure on jenkins build
I'm reasonably convinced now that these intermittent errors are due to network/connectivity issues. As described in #456 the retry logic only extends as far as getting HTTP response headers and status code, but does not apply during the consumption of the response body stream. I have a proposal for extending the scope of the retry in that PR, but it may be that we want to extract the retry logic into the cloudant-http library.
The 412 precondition failed database already exists
failure mode of this test is caused by a different problem with our retry logic. There is a failure window where it is possible that the server receives the request, but we fail to receive the response. In the database create case the subsequent retry receives a 412
since the server has already received the request and created the database. The PUT
request is idempotent, but returns a different response code for cases where it exists vs was actually created.
At this point I don't understand why this test is so much more vulnerable than others to this type of problem.
Nonetheless there are at least two things to address:
I think the changes in #456 are covering this reasonably well now. After it merges and runs for a while we'll have a better idea if there are still problems.
Slightly different failure this time: