FoundationDB / fdb-record-layer

A record-oriented store built on FoundationDB
Apache License 2.0
572 stars 102 forks source link

SizeStatisticsCollectorTest.tryStaleRead sometimes fails with "version not valid" #1431

Closed alecgrieser closed 2 years ago

alecgrieser commented 2 years ago

This has happened in I think two PRBs now, one for #1399 and one for #1373. In those cases, the SizeStatisticsCollectorTest has failed with a "version not valid" error and the following stack trace:

SizeStatisticsCollectorTest > tryStaleRead() FAILED
    com.apple.foundationdb.record.provider.foundationdb.FDBExceptions$FDBStoreException: Version not valid
        at com.apple.foundationdb.record.provider.foundationdb.FDBExceptions.wrapException(FDBExceptions.java:189)
        at com.apple.foundationdb.record.provider.foundationdb.FDBDatabase.lambda$new$0(FDBDatabase.java:169)
        at com.apple.foundationdb.record.provider.foundationdb.FDBDatabase.mapAsyncToSyncException(FDBDatabase.java:1058)
        at com.apple.foundationdb.record.provider.foundationdb.SizeStatisticsCollectorTest$SizeStatisticsCollector.lambda$collectAsync$1(SizeStatisticsCollectorTest.java:358)
        at java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:930)
        at java.base/java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:907)
        at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
        at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088)
        at com.apple.foundationdb.async.AsyncUtil$LoopPartial.apply(AsyncUtil.java:346)
        at com.apple.foundationdb.async.AsyncUtil$LoopPartial.apply(AsyncUtil.java:332)
        at java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:930)
        at java.base/java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:907)
        at java.base/java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478)
        at com.apple.foundationdb.async.TaskNotifyingExecutor$Notifier.run(TaskNotifyingExecutor.java:77)
        at java.base/java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1426)
        at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290)
        at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020)
        at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656)
        at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594)
        at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)

        Caused by:
        com.apple.foundationdb.FDBException: Version not valid
            at com.apple.foundationdb.NativeFuture.Future_getError(Native Method)
            at com.apple.foundationdb.FutureResults.getIfDone_internal(FutureResults.java:47)
            at com.apple.foundationdb.FutureResults.getIfDone_internal(FutureResults.java:28)
            at com.apple.foundationdb.NativeFuture.marshalWhenDone(NativeFuture.java:63)
            ... 7 more

This appears to happen intermittently, so it will take some investigation to figure out what's going on here

MMcM commented 2 years ago

Does that error indicate that commitVersion - 101_000_000 was negative?

alecgrieser commented 2 years ago

That does seem like a fairly plausible explanation.... The clearest place in the FDB codebase that I could find version_invalid being thrown was in a check to make sure the version passed to setVersion is non-negative:

https://github.com/apple/foundationdb/blob/e54357034339be79834330606004ee73332e64ff/fdbclient/NativeAPI.actor.cpp#L4262

alecgrieser commented 2 years ago

I posted a PR that changes the test to check if the version is in the right range (and skips the test if it's not): #1434. That being said, it's possible a more nuanced solution that tries to mock some more things and inject an error would be better, though it's a bit difficult to insert that mocking logic into the test as it's currently implemented. But this would at least stop the test from causing builds to fail.