JanusGraph / janusgraph

JanusGraph: an open-source, distributed graph database
https://janusgraph.org
Other
5.26k stars 1.16k forks source link

fix: handle BerkeleyJE DB interruption [tp-tests] #4425

Closed tien closed 3 months ago

tien commented 4 months ago

Currently, any interruption to BerkeleyJE DB will cause Janusgraph to stop working entirely & require a complete restart of Janusgraph. This PR will make it so interruption to BerkeleyJE DB will close & reopen the Berkeley's environment instead.

Fixes #2120

linux-foundation-easycla[bot] commented 4 months ago

CLA Signed

The committers listed above are authorized under a signed CLA.

tien commented 4 months ago

@porunov @li-boxuan struggling on how to test this, the test that I copied from #3990 pass but when I check it actually doesn't do anything.

tien commented 4 months ago

Update: I was able to write a test to consistently trigger the error n prove that my change should work.

tien commented 4 months ago

Have made a change to only re/initialize on interruption.

tien commented 4 months ago

Oops, forgot to enable my test since it works now, have done so.

tien commented 4 months ago

@li-boxuan @porunov all tests passed, can you guys take a look 👀? Keen to get this merged soon 💪.

li-boxuan commented 4 months ago

I am sorry but I seem to have missed some context. Do you know why @mad 's previous PR didn't pass the test, and what did you do differently to solve the problem?

tien commented 4 months ago

I am sorry but I seem to have missed some context. Do you know why @mad 's previous PR didn't pass the test, and what did you do differently to solve the problem?

@li-boxuan I'm not sure why that one didn't pass the test. I'm using a completely different approach in this PR that involves a lot less changes.

mad commented 4 months ago

Thanks for your contribution!

For proper testing, you need to disable test filtering here, here and here

To be completely correct, you still need to enabld this (supportsInterruption)

tien commented 4 months ago

Thank you @mad, I have committed the suggested changes.

tien commented 4 months ago

Thanks for your contribution!

For proper testing, you need to disable test filtering here, here and here

To be completely correct, you still need to enabld this (supportsInterruption)

@mad Look like doing this will actually fail the tests.

I think it's because BerkeleyJE DB technically still does not support interruption, as in it will throw an error upon being interrupted. We are only reopening the environment here so that future requests won't all fail.

tien commented 4 months ago

@mad @li-boxuan I have implemented the interruption retry logic.

tien commented 4 months ago

tp-test are failing, is there anyway I can run n debug these locally?

li-boxuan commented 4 months ago

is there anyway I can run n debug these locally?

I never figured out a great way to debug tp-tests. Others might be able to help. The way I do is to find the corresponding test suite in tinkerpop repository, and run all the tests from there. An adhoc way is to copy that test suite class (.java) file to janusgraph repo on your local machine, and run tests from there. This way, you can remove other unrelated test classes...

tien commented 4 months ago

I've integrated some of the other changes from @mad, so the interruption test suite should pass now. cc @mad @li-boxuan

tien commented 4 months ago

Hmm, look like there was one last flaky test in the interruption test.

This test parameter in the TraversalInterruptionTest & TraversalInterruptionComputerTest is failing sometime locally for me, which look like it did fail on the CI:

{"g_E_properties", (Function<GraphTraversalSource, GraphTraversal<?,?>>) g -> g.E(), (UnaryOperator<GraphTraversal<?,?>>) t -> t.properties()}

Will investigate, getting close, just 1 case left that need to pass 💪.

tien commented 4 months ago

Okay, I've made further changes, after which I haven't been able to reproduce any failure when running the test locally. Hopefully that means the flakiness has gone away 😅.

tien commented 4 months ago

Oops, forgot to check other tests beside interruption, made I mistake, have pushed a fix.

tien commented 4 months ago

Have made the best effort implementation, so BerkeleyJE DB environment will be reset after interruption & also that race condition around thread are covered so that the interruption test from tp-test will always works.

tien commented 4 months ago

@li-boxuan @porunov all tests passed :D

There's the coverage check, but I don't think I can get it any higher though, cuz a lot of the uncovered code are catch clauses. Which are impossible to trigger consistently.

li-boxuan commented 4 months ago

Codecov sometimes seems buggy and we don't know why. Don't worry about that. Congrats on passing all tests! @mad would you be able to take another look?

tien commented 4 months ago

Awesome, thanks for all the helps and supports guys 🙏

li-boxuan commented 3 months ago

I am gonna merge this following lazy merge consensus. Thank you @tien for fixing this!

janusgraph-automations commented 3 months ago

💚 All backports created successfully

Status Branch Result
v1.0

Note: Successful backport PRs will be merged automatically after passing CI.

Questions ?

Please refer to the Backport tool documentation and see the Github Action logs for details

porunov commented 3 months ago

After merging this PR tp-tests are not failing in master branch for berkleyje.

Error:  Errors: 
Error:  org.apache.tinkerpop.gremlin.process.traversal.TraversalInterruptionTest.shouldRespectThreadInterruptionInVertexStep[expectInterruption(g_V_outE)]
[INFO]   Run 1: PASS
Error:    Run 2: TraversalInterruptionTest>AbstractGremlinTest.tearDown:155 » LockTimeout (JE 18.3.12) Lock expired. Locker 1229014061 2150_main_Txn: waited for lock on database=_jeNameMap LockAddr:617727493 LSN=0x0/0x1167 type=WRITE grant=WAIT_NEW timeoutMillis=500 startTime=1715693374163 endTime=1715693374663

I will try to re-run those tests again. If they don't pass we should revert this commit.

tien commented 3 months ago

After merging this PR tp-tests are not failing in master branch for berkleyje.

Error:  Errors: 
Error:  org.apache.tinkerpop.gremlin.process.traversal.TraversalInterruptionTest.shouldRespectThreadInterruptionInVertexStep[expectInterruption(g_V_outE)]
[INFO]   Run 1: PASS
Error:    Run 2: TraversalInterruptionTest>AbstractGremlinTest.tearDown:155 » LockTimeout (JE 18.3.12) Lock expired. Locker 1229014061 2150_main_Txn: waited for lock on database=_jeNameMap LockAddr:617727493 LSN=0x0/0x1167 type=WRITE grant=WAIT_NEW timeoutMillis=500 startTime=1715693374163 endTime=1715693374663

I will try to re-run those tests again. If they don't pass we should revert this commit.

I'm pretty sure this is just a flaky error due to Berkeley DB can't handle many concurrent reads/writes, which now happen cuz we no longer exclude it from some of the tests.

Like I can see that this pass https://github.com/JanusGraph/janusgraph/actions/runs/9075171201/job/24965523610 But this run got the time-out error https://github.com/JanusGraph/janusgraph/actions/runs/9075171201/job/24965524860

Look like a whole other issue: #1623

porunov commented 3 months ago

After merging this PR tp-tests are not failing in master branch for berkleyje.

Error:  Errors: 
Error:  org.apache.tinkerpop.gremlin.process.traversal.TraversalInterruptionTest.shouldRespectThreadInterruptionInVertexStep[expectInterruption(g_V_outE)]
[INFO]   Run 1: PASS
Error:    Run 2: TraversalInterruptionTest>AbstractGremlinTest.tearDown:155 » LockTimeout (JE 18.3.12) Lock expired. Locker 1229014061 2150_main_Txn: waited for lock on database=_jeNameMap LockAddr:617727493 LSN=0x0/0x1167 type=WRITE grant=WAIT_NEW timeoutMillis=500 startTime=1715693374163 endTime=1715693374663

I will try to re-run those tests again. If they don't pass we should revert this commit.

I'm pretty sure this is just a flaky error due to Berkeley DB can't handle many concurrent reads/writes, which now happen cuz we no longer exclude it from some of the tests.

Like I can see that this pass https://github.com/JanusGraph/janusgraph/actions/runs/9075171201/job/24965523610 But this run got the time-out error https://github.com/JanusGraph/janusgraph/actions/runs/9075171201/job/24965524860

Look like a whole other issue: #1623

Yes, tests for berkeleyje with Java 8 passed after restart, but failed again for Java 11. I restarted the job again for Java 11. We should probably increase those timeouts as I see timeoutMillis=500 which is probably isn't enough for berkeleyje in GitHub Actions sometimes.

tien commented 3 months ago

@porunov I can see that it failed again :(

It's like a flaky race condition where the test try to create an entirely new Berkeley DB graph when the previous one is still invalid/re-intializing (Probably due to the lock issue). This isn't handled by this PR: the BerkeleyDB store doesn't have control over other instances created before or after it.

This won't happen in production because you'll most likely & should have only one Janusgraph with Berkeley DB instance up and running.

Maybe we should just continue what we did before and disable this particular test for BerkeleyDB?

Or push on with the flaky test, It has always pass for me when I run this locally :p

tien commented 3 months ago

Though like you said @porunov increasing the lock time-out might just make this problem go away :p