convergencelabs / convergence-project

The project used for Convergence Project Management and Issue Reporting
https://convergence.io
42 stars 5 forks source link

Occasional "Connection reset" and 500 Internal Server Error #190

Open earshinov opened 3 years ago

earshinov commented 3 years ago

Hi!

We occasionally have "Connection reset" errors in Convergence server logs and 500 Internal Server Error in response to a REST API call.

GET domains/convergence/default/models/ffreport-3605: 500 Server Error
b'{"body":{"error_code":"internal_server_error"},"ok":false}'

Convergence server logs:

INFO: Caught Network I/O errors on odb.pl-grt-orientdb:2424/7048553260737367023, trying an automatic reconnection... (error: Connection reset)
Caused by: java.net.SocketException: Connection reset
    at java.net.SocketInputStream.read(SocketInputStream.java:186) ~[?:?]
    at java.net.SocketInputStream.read(SocketInputStream.java:140) ~[?:?]
    at java.io.BufferedInputStream.fill(BufferedInputStream.java:252) ~[?:?]
    at java.io.BufferedInputStream.read(BufferedInputStream.java:271) ~[?:?]
    at java.io.DataInputStream.readByte(DataInputStream.java:270) ~[?:?]
    at com.orientechnologies.orient.enterprise.channel.binary.OChannelBinary.readByte(OChannelBinary.java:75) ~[com.orientechnologies.orientdb-core-3.0.32.jar:3.0.32]
    at com.orientechnologies.orient.client.binary.OChannelBinaryAsynchClient.beginResponse(OChannelBinaryAsynchClient.java:176) ~[com.orientechnologies.orientdb-client-3.0.32.jar:3.0.32]
    at com.orientechnologies.orient.client.binary.OChannelBinaryAsynchClient.beginResponse(OChannelBinaryAsynchClient.java:160) ~[com.orientechnologies.orientdb-client-3.0.32.jar:3.0.32]
    at com.orientechnologies.orient.client.remote.OStorageRemote.beginResponse(OStorageRemote.java:1891) ~[com.orientechnologies.orientdb-client-3.0.32.jar:3.0.32]
    at com.orientechnologies.orient.client.remote.OStorageRemote.lambda$networkOperationRetryTimeout$2(OStorageRemote.java:248) ~[com.orientechnologies.orientdb-client-3.0.32.jar:3.0.32]
    at com.orientechnologies.orient.client.remote.OStorageRemote.baseNetworkOperation(OStorageRemote.java:307) ~[com.orientechnologies.orientdb-client-3.0.32.jar:3.0.32]
    ... 42 more
16:34:35 ERROR RealtimeModelActor - Unexpected error getting model
com.orientechnologies.common.io.OIOException: Connection reset
    at com.orientechnologies.orient.client.remote.OStorageRemote.baseNetworkOperation(OStorageRemote.java:351) ~[com.orientechnologies.orientdb-client-3.0.32.jar:3.0.32]
    at com.orientechnologies.orient.client.remote.OStorageRemote.networkOperationRetryTimeout(OStorageRemote.java:228) ~[com.orientechnologies.orientdb-client-3.0.32.jar:3.0.32]
    at com.orientechnologies.orient.client.remote.OStorageRemote.networkOperationNoRetry(OStorageRemote.java:261) ~[com.orientechnologies.orientdb-client-3.0.32.jar:3.0.32]
    at com.orientechnologies.orient.client.remote.OStorageRemote.command(OStorageRemote.java:947) ~[com.orientechnologies.orientdb-client-3.0.32.jar:3.0.32]
    at com.orientechnologies.orient.core.db.document.ODatabaseDocumentRemote.indexQuery(ODatabaseDocumentRemote.java:397) ~[com.orientechnologies.orientdb-client-3.0.32.jar:3.0.32]
    at com.orientechnologies.orient.core.index.OIndexRemoteOneValue.get(OIndexRemoteOneValue.java:48) ~[com.orientechnologies.orientdb-core-3.0.32.jar:3.0.32]
    at com.orientechnologies.orient.core.index.OIndexRemoteOneValue.get(OIndexRemoteOneValue.java:39) ~[com.orientechnologies.orientdb-core-3.0.32.jar:3.0.32]
    at com.orientechnologies.orient.core.index.OIndexAbstractDelegate.get(OIndexAbstractDelegate.java:59) ~[com.orientechnologies.orientdb-core-3.0.32.jar:3.0.32]
    at com.orientechnologies.orient.core.index.OIndexTxAwareOneValue.get(OIndexTxAwareOneValue.java:231) ~[com.orientechnologies.orientdb-core-3.0.32.jar:3.0.32]
    at com.orientechnologies.orient.core.index.OIndexTxAwareOneValue.get(OIndexTxAwareOneValue.java:38) ~[com.orientechnologies.orientdb-core-3.0.32.jar:3.0.32]
    at com.convergencelabs.convergence.server.backend.datastore.OrientDBUtil$.$anonfun$findDocumentFromSingleValueIndex$1(OrientDBUtil.scala:189) ~[com.convergencelabs.convergence-server-1.0.0-rc.7.jar:1.0.0-rc.7]
    at scala.util.Try$.apply(Try.scala:210) ~[org.scala-lang.scala-library-2.13.2.jar:?]
    at com.convergencelabs.convergence.server.backend.datastore.OrientDBUtil$.findDocumentFromSingleValueIndex(OrientDBUtil.scala:185) ~[com.convergencelabs.convergence-server-1.0.0-rc.7.jar:1.0.0-rc.7]
    at com.convergencelabs.convergence.server.backend.datastore.domain.model.ModelStore$.com$convergencelabs$convergence$server$backend$datastore$domain$model$ModelStore$$findModelDocument(ModelStore.scala:333) ~[com.convergencelabs.convergence-server-1.0.0-rc.7.jar:1.0.0-rc.7]
    at com.convergencelabs.convergence.server.backend.datastore.domain.model.ModelStore.$anonfun$getModel$1(ModelStore.scala:198) ~[com.convergencelabs.convergence-server-1.0.0-rc.7.jar:1.0.0-rc.7]
    at com.convergencelabs.convergence.server.backend.db.PooledDatabaseProvider.$anonfun$withDatabase$3(PooledDatabaseProvider.scala:58) ~[com.convergencelabs.convergence-server-1.0.0-rc.7.jar:1.0.0-rc.7]
    at com.convergencelabs.convergence.server.util.TryWithResource.tryWithResolvedResource(TryWithResource.scala:50) ~[com.convergencelabs.convergence-server-1.0.0-rc.7.jar:1.0.0-rc.7]
    at com.convergencelabs.convergence.server.util.TryWithResource.com$convergencelabs$convergence$server$util$TryWithResource$$tryWithResource(TryWithResource.scala:36) ~[com.convergencelabs.convergence-server-1.0.0-rc.7.jar:1.0.0-rc.7]
    at com.convergencelabs.convergence.server.util.TryWithResource$.apply(TryWithResource.scala:78) ~[com.convergencelabs.convergence-server-1.0.0-rc.7.jar:1.0.0-rc.7]
    at com.convergencelabs.convergence.server.backend.db.PooledDatabaseProvider.$anonfun$withDatabase$1(PooledDatabaseProvider.scala:56) ~[com.convergencelabs.convergence-server-1.0.0-rc.7.jar:1.0.0-rc.7]
    at scala.util.Success.flatMap(Try.scala:258) ~[org.scala-lang.scala-library-2.13.2.jar:?]
    at com.convergencelabs.convergence.server.backend.db.PooledDatabaseProvider.withDatabase(PooledDatabaseProvider.scala:55) ~[com.convergencelabs.convergence-server-1.0.0-rc.7.jar:1.0.0-rc.7]
    at com.convergencelabs.convergence.server.backend.datastore.AbstractDatabasePersistence.withDb(AbstractDatabasePersistence.scala:32) ~[com.convergencelabs.convergence-server-1.0.0-rc.7.jar:1.0.0-rc.7]
    at com.convergencelabs.convergence.server.backend.datastore.domain.model.ModelStore.getModel(ModelStore.scala:196) ~[com.convergencelabs.convergence-server-1.0.0-rc.7.jar:1.0.0-rc.7]
    at com.convergencelabs.convergence.server.backend.services.domain.model.RealtimeModelActor.$anonfun$retrieveModel$3(RealtimeModelActor.scala:325) ~[com.convergencelabs.convergence-server-1.0.0-rc.7.jar:1.0.0-rc.7]
    at com.convergencelabs.convergence.server.backend.services.domain.model.RealtimeModelActor.$anonfun$retrieveModel$3$adapted(RealtimeModelActor.scala:323) ~[com.convergencelabs.convergence-server-1.0.0-rc.7.jar:1.0.0-rc.7]
    at scala.util.Success.flatMap(Try.scala:258) ~[org.scala-lang.scala-library-2.13.2.jar:?]
    at com.convergencelabs.convergence.server.backend.services.domain.model.RealtimeModelActor.$anonfun$retrieveModel$1(RealtimeModelActor.scala:323) ~[com.convergencelabs.convergence-server-1.0.0-rc.7.jar:1.0.0-rc.7]
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18) ~[org.scala-lang.scala-library-2.13.2.jar:?]
    at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:671) ~[org.scala-lang.scala-library-2.13.2.jar:?]
    at scala.concurrent.impl.Promise$Transformation.run(Promise.scala:430) ~[org.scala-lang.scala-library-2.13.2.jar:?]
    at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:56) ~[com.typesafe.akka.akka-actor_2.13-2.6.6.jar:2.6.6]
    at akka.dispatch.BatchingExecutor$BlockableBatch.$anonfun$run$1(BatchingExecutor.scala:93) ~[com.typesafe.akka.akka-actor_2.13-2.6.6.jar:2.6.6]
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18) [org.scala-lang.scala-library-2.13.2.jar:?]
    at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:94) [org.scala-lang.scala-library-2.13.2.jar:?]
    at akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:93) [com.typesafe.akka.akka-actor_2.13-2.6.6.jar:2.6.6]
    at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:48) [com.typesafe.akka.akka-actor_2.13-2.6.6.jar:2.6.6]
    at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:48) [com.typesafe.akka.akka-actor_2.13-2.6.6.jar:2.6.6]
    at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290) [?:?]
    at java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020) [?:?]
    at java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656) [?:?]
    at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594) [?:?]
    at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:177) [?:?]

More details:

Maybe you could give us some hints for what might go wrong and tips on what Convergence server settings to tweak. Since GET should be a safe method, should we implement retries on the side of our application that makes REST API calls?

Versions

Describe the Bug

REST API client sometimes gets 500 Internal Server Error from Convergence.

Expected Behavior

If Convergence server has some kind of an OrientDB connection pool which might contain stale connections, it should reconnect when hitting connection reset.

Alternatively, if retries are considered a caller's duty, Convergence server should probably return 503 instead of 500 to signify that the client should make a retry.

mmacfadden commented 3 years ago

Thanks for reporting. This does seem like a bug. I do believe we are using the built in OrientDB connection pooling capability. In general, we would expect that the connection pool would be keeping the connections alive or fresh.

mmacfadden commented 3 years ago

Finally getting to look at this error. I did confirm that the error is inside OrientDB client code. Here specifically:

https://github.com/orientechnologies/orientdb/blob/3c1a6e6a9a18ecc0312058ac9f2ba86cd2365594/client/src/main/java/com/orientechnologies/orient/client/remote/OStorageRemote.java#L348

We are using the OrientDB connection pooling, so this seems like there is an error in the connection pool logic on the OrientDB side. We are in the process of upgrading from 3.0.x to 3.2.x of OrientDB so I am hoping that this will resolve the situation.

I do agree on the 503 vs 500 error.

earshinov commented 2 years ago

As a matter of fact, after we reduced the number of convergence-server instances we run in the Akka cluster from 3 to 1 we basically stopped receiving these "Connection reset" errors. They only happen occasionally (a couple of times in a month) and no longer bother us.

But as you were going to upgrade OrientDB client libraries to 3.2.x (and it seems it hasn't been done yet in 1.0.0-rc.12), I think we should keep this issue open.