JanusGraph / janusgraph-foundationdb

FoundationDB storage adapter for JanusGraph
Other
53 stars 18 forks source link

Closing JanusGraph intermittently gets into an endless loop if there is an FDB error #59

Open davisdk opened 3 years ago

davisdk commented 3 years ago

We have been using this adapter with JanusGraph for quite a while and have not seen issues, until we started encountering this. We have a job that usually succeeds and terminates, but occasionally runs forever with this error output over and over:

[pool-37-thread-1] WARN org.janusgraph.diskstorage.log.kcvs.KCVSLog - Could not read messages for timestamp [2021-05-11T02:21:21.872Z] (this read will be retried)
org.janusgraph.core.JanusGraphException: Could not execute operation due to backend exception
    at org.janusgraph.diskstorage.util.BackendOperation.execute(BackendOperation.java:56)
    at org.janusgraph.diskstorage.util.BackendOperation.execute(BackendOperation.java:158)
    at org.janusgraph.diskstorage.log.kcvs.KCVSLog$MessagePuller.run(KCVSLog.java:725)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: org.janusgraph.diskstorage.PermanentBackendException: Permanent exception while executing backend operation messageReading@0:0
    at org.janusgraph.diskstorage.util.BackendOperation.executeDirect(BackendOperation.java:81)
    at org.janusgraph.diskstorage.util.BackendOperation.execute(BackendOperation.java:54)
    ... 9 more
Caused by: java.util.concurrent.CompletionException: com.apple.foundationdb.FDBException: Broken promise
    at java.util.concurrent.CompletableFuture.reportJoin(CompletableFuture.java:375)
    at java.util.concurrent.CompletableFuture.join(CompletableFuture.java:1947)
    at com.apple.foundationdb.RangeQuery$AsyncRangeIterator.hasNext(RangeQuery.java:250)
    at org.janusgraph.diskstorage.foundationdb.FoundationDBRecordIterator.fetchNext(FoundationDBRecordIterator.java:64)
    at org.janusgraph.diskstorage.foundationdb.FoundationDBRecordAsyncIterator.fetchNext(FoundationDBRecordAsyncIterator.java:66)
    at org.janusgraph.diskstorage.foundationdb.FoundationDBRecordIterator.hasNext(FoundationDBRecordIterator.java:47)
    at org.janusgraph.diskstorage.util.StaticArrayEntryList.of(StaticArrayEntryList.java:390)
    at org.janusgraph.diskstorage.util.StaticArrayEntryList.ofStaticBuffer(StaticArrayEntryList.java:345)
    at org.janusgraph.diskstorage.keycolumnvalue.keyvalue.OrderedKeyValueStoreAdapter.convert(OrderedKeyValueStoreAdapter.java:145)
    at org.janusgraph.diskstorage.keycolumnvalue.keyvalue.OrderedKeyValueStoreAdapter.getSlice(OrderedKeyValueStoreAdapter.java:69)
    at org.janusgraph.diskstorage.log.kcvs.KCVSLog$MessagePuller$1.call(KCVSLog.java:791)
    at org.janusgraph.diskstorage.log.kcvs.KCVSLog$MessagePuller$1.call(KCVSLog.java:788)
    at org.janusgraph.diskstorage.util.BackendOperation.execute(BackendOperation.java:147)
    at org.janusgraph.diskstorage.util.BackendOperation$1.call(BackendOperation.java:161)
    at org.janusgraph.diskstorage.util.BackendOperation.executeDirect(BackendOperation.java:68)
    ... 10 more
Caused by: com.apple.foundationdb.FDBException: Broken promise
    at com.apple.foundationdb.NativeFuture.Future_getError(Native Method)
    at com.apple.foundationdb.FutureResults.getIfDone_internal(FutureResults.java:38)
    at com.apple.foundationdb.FutureResults.getIfDone_internal(FutureResults.java:25)
    at com.apple.foundationdb.NativeFuture.marshalWhenDone(NativeFuture.java:63)
    ... 3 more

This happens after our class performs a System.exit. In succeeded runs, the StandardJanusGraph closes all the instances of the adapter successfully, and in this case it closes all but one.

We have tried quite a few things with no success, so wanted to reach out here to see if anyone else has seen it or has any insights. With isolation level read_committed_with_write, the adapter retries only three times as expected, but then this error is thrown and it starts again.

Our current Janus storage settings are:

storage.backend=org.janusgraph.diskstorage.foundationdb.FoundationDBStoreManager
storage.lock.retries=5
storage.lock.wait-time=250
storage.fdb.cluster-file-path=/etc/foundationdb/fdb.cluster
storage.fdb.isolation-level=serializable
storage.fdb.get-range-mode=iterator

and we are using JanusGraph 0.5.3 and FDB 6.2.22. We also tried isolation-level read_committed_with_write and range mode list.

rngcntr commented 3 years ago

This sounds to me like if FDB reacts to System.exit by throwing an exception which would otherwise be thrown for retriable operations. But it's been a while since I worked with FDB so I can't say anything for sure...