couchbase / couchbase-lite-android-ce

The community edition of couchbase lite for android
Apache License 2.0
9 stars 1 forks source link

Replicator does not handle "connection refuse" errors #11

Closed MatFl closed 5 years ago

MatFl commented 5 years ago

Using CBL 2.5 we observed some crashes and strange behavior with continuous replication.

  1. When the server itself is reachable, but the service is stopped, the client receives connection refuse errors. E/CouchbaseLite/REPLICATOR: {Repl#1} Got LiteCore error: POSIX error 111 "Connection refused" The replicator then never enters STOPPED status, but keeps trying to reconnect without any backoff time. Is the application supposed to stop the replication itself when the error is received and the replicator is for example in OFFLINE state?

  2. When the replicator is in this state for a while it keeps getting faster and faster with creating new connections.

The log is spammed with this:

2019-05-13 14:19:37.228 8468-8540/com.app W/C4Socket: C4Socket.open() clazz -> com.couchbase.lite.internal.replicator.CBLWebSocket
2019-05-13 14:19:37.228 8468-8540/com.app E/CouchbaseLite/NETWORK: CBLWebSocket.socket_open()
2019-05-13 14:19:37.232 8468-9413/com.app W/CouchbaseLite/NETWORK: WebSocketListener.onFailure() response -> null: java.net.ConnectException: Failed to connect to /10.0.2.2:4984
2019-05-13 14:19:37.232 8468-8543/com.app W/C4Socket: C4Socket.dispose() handle -> 3295512024
2019-05-13 14:19:37.233 8468-8543/com.app E/CouchbaseLite/REPLICATOR: {Repl#707}==> N8litecore4repl10ReplicatorE /data/user/0/com.app/files/db.cblite2/ ->ws://10.0.2.2:4984/db/_blipsync @0xcb9d8ac0
2019-05-13 14:19:37.233 8468-8543/com.app E/CouchbaseLite/REPLICATOR: {Repl#707} Got LiteCore error: POSIX error 111 "Connection refused"
2019-05-13 14:19:37.237 8468-8540/com.app W/C4Socket: C4Socket.open() socket -> 3295513752
2019-05-13 14:19:37.237 8468-8540/com.app W/C4Socket: C4Socket.open() clazz -> com.couchbase.lite.internal.replicator.CBLWebSocket
2019-05-13 14:19:37.237 8468-8540/com.app E/CouchbaseLite/NETWORK: CBLWebSocket.socket_open()
2019-05-13 14:19:37.241 8468-9414/com.app W/CouchbaseLite/NETWORK: WebSocketListener.onFailure() response -> null: java.net.ConnectException: Failed to connect to /10.0.2.2:4984
2019-05-13 14:19:37.241 8468-8542/com.app W/C4Socket: C4Socket.dispose() handle -> 3295513752
2019-05-13 14:19:37.242 8468-8542/com.app E/CouchbaseLite/REPLICATOR: {Repl#708}==> N8litecore4repl10ReplicatorE /data/user/0/com.app/files/db.cblite2/ ->ws://10.0.2.2:4984/db/_blipsync @0xc50139c0
2019-05-13 14:19:37.243 8468-8542/com.app E/CouchbaseLite/REPLICATOR: {Repl#708} Got LiteCore error: POSIX error 111 "Connection refused"
2019-05-13 14:19:37.247 8468-8540/com.app W/C4Socket: C4Socket.open() socket -> 3295517400
2019-05-13 14:19:37.247 8468-8540/com.app W/C4Socket: C4Socket.open() clazz -> com.couchbase.lite.internal.replicator.CBLWebSocket
2019-05-13 14:19:37.247 8468-8540/com.app E/CouchbaseLite/NETWORK: CBLWebSocket.socket_open()
2019-05-13 14:19:37.253 8468-9415/com.app W/CouchbaseLite/NETWORK: WebSocketListener.onFailure() response -> null: java.net.ConnectException: Failed to connect to /10.0.2.2:4984
2019-05-13 14:19:37.254 8468-8543/com.app W/C4Socket: C4Socket.dispose() handle -> 3295517400
2019-05-13 14:19:37.254 8468-8543/com.app E/CouchbaseLite/REPLICATOR: {Repl#709}==> N8litecore4repl10ReplicatorE /data/user/0/com.app/files/db.cblite2/ ->ws://10.0.2.2:4984/db/_blipsync @0xc5013c40
2019-05-13 14:19:37.255 8468-8543/com.app E/CouchbaseLite/REPLICATOR: {Repl#709} Got LiteCore error: POSIX error 111 "Connection refused"

As you can see it creates hundreds of retries in a very short amount of time.

When stopping the replication at this point, it does not really stop the process. It keeps trying to create new connections. After a while it crashes with this exception:

    java.lang.NullPointerException: Attempt to read from field 'com.couchbase.litecore.C4Replicator com.couchbase.lite.AbstractReplicator.c4repl' on a null object reference
        at com.couchbase.lite.AbstractReplicator.access$400(AbstractReplicator.java:74)
        at com.couchbase.lite.AbstractReplicator$4.statusChanged(AbstractReplicator.java:637)
        at com.couchbase.litecore.C4Replicator.statusChangedCallback(C4Replicator.java:184)
bmeike commented 5 years ago
When the server itself is reachable, but the service is stopped, the client receives connection refuse errors.
E/CouchbaseLite/REPLICATOR: {Repl#1} Got LiteCore error: POSIX error 111 "Connection refused"
The replicator then never enters STOPPED status, but keeps trying to reconnect without any backoff time.

Yes. This is intended behavior. It should, however, back off exponentially.

Is the application supposed to stop the replication itself when the error is received and the replicator is for example in OFFLINE state?

The application can do so if desired. The replication should stop after 2 retries.

... unless, of course, the replicator is configured continuous.

When the replicator is in this state for a while it keeps getting faster and faster with creating new connections.

Your log shows several different replicators running. I'm curious as to why you are running so many at once...

MatFl commented 5 years ago

Is the application supposed to stop the replication itself when the error is received and the replicator is for example in OFFLINE state?

The application can do so if desired. The replication should stop after 2 retries.

... unless, of course, the replicator is configured continuous.

It is a continuous replication.

When the replicator is in this state for a while it keeps getting faster and faster with creating new connections.

Your log shows several different replicators running. I'm curious as to why you are running so many at once...

I only have one continuous push and pull replicator started. Whatever is happening here is done internally in the replicator.

MatFl commented 5 years ago

I just noticed, that on a actual device this issue is not as critical. The incredible number of connections came from tests on an emulator. But I don't understand why this would make a difference.

However even on the phone, the replication sometimes keeps trying to create new connections, even when it is supposed to be stopped.

bmeike commented 5 years ago

@MatFl Opened https://issues.couchbase.com/browse/CBL-131 to track this. Please follow it there.