JanusGraph / janusgraph

JanusGraph: an open-source, distributed graph database
https://janusgraph.org
Other
5.3k stars 1.17k forks source link

JanusGraph image stop responding after query timeout #2120

Open doryosef opened 4 years ago

doryosef commented 4 years ago

I'm using the default docker image janusgraph/janusgraph:latest (Berkeley and Lucene) and connecting with gremlin console.

When JanusGraph server exceeded his 'evaluationTimeout' the server stop responding

server error:

java.util.concurrent.TimeoutException: Evaluation exceeded the configured 'evaluationTimeout' threshold of 30000 ms or evaluation was otherwise cancelled directly for request [g.V()]
        at org.apache.tinkerpop.gremlin.groovy.engine.GremlinExecutor.lambda$eval$1(GremlinExecutor.java:316)
        at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
        at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170)
        at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
        at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at java.lang.Thread.run(Thread.java:748)
1318978 [pool-6-thread-1] WARN  org.janusgraph.diskstorage.log.kcvs.KCVSLog  - Could not read messages for timestamp [2020-05-24T10:12:30.449Z] (this read will be retried)
org.janusgraph.core.JanusGraphException: Could not execute operation due to backend exception
        at org.janusgraph.diskstorage.util.BackendOperation.execute(BackendOperation.java:56)
        at org.janusgraph.diskstorage.util.BackendOperation.execute(BackendOperation.java:158)
        at org.janusgraph.diskstorage.log.kcvs.KCVSLog$MessagePuller.run(KCVSLog.java:725)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: org.janusgraph.diskstorage.PermanentBackendException: Could not start BerkeleyJE transaction
        at org.janusgraph.diskstorage.berkeleyje.BerkeleyJEStoreManager.beginTransaction(BerkeleyJEStoreManager.java:163)
        at org.janusgraph.diskstorage.berkeleyje.BerkeleyJEStoreManager.beginTransaction(BerkeleyJEStoreManager.java:47)
        at org.janusgraph.diskstorage.keycolumnvalue.keyvalue.OrderedKeyValueStoreManagerAdapter.beginTransaction(OrderedKeyValueStoreManagerAdapter.java:68)
        at org.janusgraph.diskstorage.log.kcvs.KCVSLog.openTx(KCVSLog.java:319)
        at org.janusgraph.diskstorage.util.BackendOperation.execute(BackendOperation.java:145)
        at org.janusgraph.diskstorage.util.BackendOperation$1.call(BackendOperation.java:161)
        at org.janusgraph.diskstorage.util.BackendOperation.executeDirect(BackendOperation.java:68)
        at org.janusgraph.diskstorage.util.BackendOperation.execute(BackendOperation.java:54)
        ... 9 more
Caused by: com.sleepycat.je.ThreadInterruptedException: (JE 18.3.12) Environment must be closed, caused by: com.sleepycat.je.ThreadInterruptedException: Environment invalid because of previous exception: (JE 18.3.12) /var/lib/janusgraph/data java.lang.InterruptedException THREAD_INTERRUPTED: InterruptedException may cause incorrect internal state, unable to continue. Environment is invalid and must be closed.
        at com.sleepycat.je.ThreadInterruptedException.wrapSelf(ThreadInterruptedException.java:105)
        at com.sleepycat.je.dbi.EnvironmentImpl.checkIfInvalid(EnvironmentImpl.java:1835)
        at com.sleepycat.je.dbi.EnvironmentImpl.checkOpen(EnvironmentImpl.java:1844)
        at com.sleepycat.je.Environment.checkOpen(Environment.java:2697)
        at com.sleepycat.je.Environment.beginTransactionInternal(Environment.java:1409)
        at com.sleepycat.je.Environment.beginTransaction(Environment.java:1383)
        at org.janusgraph.diskstorage.berkeleyje.BerkeleyJEStoreManager.beginTransaction(BerkeleyJEStoreManager.java:146)
        ... 16 more
Caused by: com.sleepycat.je.ThreadInterruptedException: Environment invalid because of previous exception: (JE 18.3.12) /var/lib/janusgraph/data java.lang.InterruptedException THREAD_INTERRUPTED: InterruptedException may cause incorrect internal state, unable to continue. Environment is invalid and must be closed.
        at com.sleepycat.je.latch.LatchImpl.acquireExclusive(LatchImpl.java:67)
        at com.sleepycat.je.tree.IN.latch(IN.java:547)
        at com.sleepycat.je.dbi.CursorImpl.latchBIN(CursorImpl.java:402)
        at com.sleepycat.je.dbi.CursorImpl.cloneCursor(CursorImpl.java:230)
        at com.sleepycat.je.Cursor.beginMoveCursor(Cursor.java:5252)
        at com.sleepycat.je.Cursor.beginMoveCursor(Cursor.java:5259)
        at com.sleepycat.je.Cursor.retrieveNextNoDups(Cursor.java:3550)
        at com.sleepycat.je.Cursor.retrieveNext(Cursor.java:3312)
        at com.sleepycat.je.Cursor.getInternal(Cursor.java:1313)
        at com.sleepycat.je.Cursor.get(Cursor.java:1244)
        at com.sleepycat.je.Cursor.getNext(Cursor.java:1512)

after the query been sent to server and timeout exceeded other queries which worked before gets same response

Evaluation exceeded the configured 'evaluationTimeout' threshold of 30000 ms or evaluation was otherwise cancelled directly for request [g.V().limit(4).valueMap()]: null - try increasing the timeout with the :remote command

cbobed commented 3 years ago

I've found the same behaviour in 0.5.3 submitting scripts both from the console and from a connection. Once the server launches a timeout, it stops answering and tells you that it's always a timeout.

I can confirm that it happens with Berkeley + ES, versions 0.5.2 and 0.5.3 (when using Cassandra + ES in those versions, this doesn't happen).

Omig12 commented 3 years ago

I've found the same behaviour in 0.5.3 submitting scripts both from the console and from a connection. Once the server launches a timeout, it stops answering and tells you that it's always a timeout.

I can confirm that it happens with Berkeley + ES, versions 0.5.2 and 0.5.3 (when using Cassandra + ES in those versions, this doesn't happen).

We've encountered the same issue using the full release version 0.5.3 with the Cassandra + ES backend, connecting through a JavaScript Driver, a Python Driver, and a Gremlin.sh Groovy console.

mohamad-haddad-tribo commented 2 years ago

I faced same issue on 0.6 (latest) + Cassandra + ES.

Have no idea why, any update on how to overcome it? I had to remove all my datas then re-run the engine to get it working, does it mean that the data is corrupted?

farodin91 commented 2 years ago

@mohamad-haddad-tribo The exception from above clear comes from berkeley. I'm sure you did get a berkeley exception in cassandra setup.

javiramos1 commented 2 years ago

I'm also running into the same issue with Cassandra during data ingestion using concurrent inserts

jldevezas commented 2 years ago

I am using JanusGraph 0.6.0 and I confirm this is still an issue with BerkeleyDB. Once this error occurs, the server won't be able to recover from it. (P.S.: I know 0.6.1 has been released, but I was encountering issues with it, so I stick with 0.6.0).

delenius commented 1 year ago

Same for us on the in-memory backend :(

m-thirumal commented 11 months ago

The same issue with the Cassandra backend in 1.0.0

tien commented 6 months ago

I've just ran into this issue with version 1.0.0.

porunov commented 4 days ago

The PR #4425 has been reverted by the PR #4702 Thus, I'm re-opening this issue.