JanusGraph / janusgraph-foundationdb

FoundationDB storage adapter for JanusGraph
Other
53 stars 18 forks source link

Fix remove bug in FoundationDBTx.getMultiRange() #15

Closed rngcntr closed 4 years ago

rngcntr commented 4 years ago

This bug caused queries to be retried even if they were successful. Fixes #14

mbrukman commented 4 years ago

Tests failed on master after the merge of this PR (they seem to be hanging?).

@rngcntr — could you please take a look into this?

FWIW, future PRs (and pushes to existing PRs) will all trigger Travis, but this PR was opened before we had Travis CI integration, so no tests automatically ran for it.

twilmes commented 4 years ago

My bad, I overlooked that the Travis build hadn't run on this. @rngcntr I took a quick look and don't have a solution at the moment but have attached a thread dump I took that shows txCtr.incrementAndGet(); calls in FoundationDBTx.restart() as blocked. It may be a clue. jgfdb-stack-dump.txt

mbrukman commented 4 years ago

If this isn't a quick fix, would it be OK to revert this PR for now to get the build back to a green state, and reapply it with the fix, once it's discovered? We have a number of other PRs that have been waiting for us to add Travis CI before we would review them, so it would be nice to get back to a green build state ASAP.

rngcntr commented 4 years ago

@mbrukman Thanks for pointing that out. I will probably have a closer look into it on monday or tuesday. It's absolutely OK to revert the PR as long as it's not working as intended ;)

rngcntr commented 4 years ago

I checked out the current master branch with this PR already merged and ran a quick mvn verify:

Tests run: 252, Failures: 0, Errors: 0, Skipped: 26

So everything looks ok locally. I suspect there's something non-deterministic. What I don't know is whether or not this PR actually has an influence on that. Edit: Ran the tests three times without any issues

twilmes commented 4 years ago

Yeah, I think you're right @rngcntr, looking closer at the thread dump, there is a race condition that can occur. In the case I'm seeing, getMultiRange (which is synchronized) is being called which is locking the transaction. Then, a transaction reset is being triggered, likely by the time limit being exceeded by the FDB driver, on that same tx which triggers a call to restart(), which is also synchronized. It can't get the lock so it's stuck and then that blocks everything. I do not believe this is related to your commits so I'm entering a new issue.