diennea / herddb

A JVM-embeddable Distributed Database
https://herddb.org
Apache License 2.0
316 stars 46 forks source link

BRIN fix load/unload with locks, fixes #820 #821

Closed diegosalvi closed 8 months ago

diegosalvi commented 9 months ago

This PR fixes #820 Changes load/unload BRIN pages deferring unload until the unlock of currently handled BRIN block, doing so we don't keep current block locked until other block has been fully unload (that needs a lock too). Changed block loading during blocks merge too: now we don't load in the page memory blocks that aren't needed anymore because the will deleted at merge end. The neat result is that there is one more page in thread memory that will discarded/cleaned at the end.

Following this checklist to help us incorporate your contribution quickly and easily:

To make clear that you license your contribution under the Apache License Version 2.0, January 2004 you have to acknowledge this by using the following check-box.

eolivelli commented 9 months ago

As we have a trace for the deadlock, would it be possible to add a test case the reproduces the problem ?

diegosalvi commented 9 months ago

@eolivelli it had my first concern but I haven't been able do reproduce in a test 'till now

diegosalvi commented 9 months ago

Do not integrate! I found some strange behaviours while attempting to reproduce the deadlock.

diegosalvi commented 8 months ago

Added a testcase that reproduced the deadlock

diegosalvi commented 8 months ago

Now can be merged

eolivelli commented 8 months ago

The patch is not trivial, I will take a look this week

Thanks

diegosalvi commented 8 months ago

@eolivelli Yes I had to change many things to make it working. There were many bugs in page and lock handling. Biggest changes are:

  1. Fix load/unload deadlock. When loading a page the page that should be unloaded (due to page replacement policy) isn't unloaded as fast as possible but only when lock are released. This avoid some "random" deadlock on load&unload. The net effect is that we have one (possibly full) page in memory for the current thread. This can be improved adding a "try unload" mechanic to pages (so... try unload if you can take the lock, abort unload and signal otherwise) but it should be supported by all pages types (brin, blink and datapages): the patch would be too big. We'll reserve the improvement for the future.
  2. Fix phantom node modifications. Blocks now have a new flag "detached". It is set only during merging to signal that a node block is detached from the tree. We need to have this information somehow due to concurrent search, add and delete: they keep a pointer to a block that can be merged and removed from the tree between a concurrent node find and node lock. Without such information we could remove or add data to a dead node. We could avoid the flag doing a lookup on the tree after lock checking if the node is still there but it is much more expensive. When a concurrent search, add or delete land to a detached node needs to restart his work from the previous node (data is always merged with previous blocks), we don't have a direct pointer and handling it is too expensive: in such cases (they arise only with concurrent merges/checkpoints) a new search will be started with the current key to continue the operation
  3. Avoid deleted node checkpoint. During merge we checkpointed merged node too. It isn't really necessary: they are dead node that are discarded from the tree, they aren't needed anymore, no need to checkpoint them. To avoid load&unload bugs (the page cannot be reloaded because the page on disk now represent an old snapshot) we keep the page in thread stack memory, this isn't a problem: the sum of all pages deleted per merge sum at a full page at maximum and the will be discarded as soon as all page references are lost (concurrent lookup handle the detached page and skip it)
  4. Fix add/delete/search with checkpoint deadlock. Actually is more search with checkpoint deadlock because we don't execute add or deletes concurrently with checkpoints but one change fixes them all. Checkpoint procedure order has been reversed. Before the patch we started from the tail looking for nodes to merge, now we start from the head. Metadata are still written in reverse order to be back compatible with existing metadata (and we avoid to rewrite the recover procedure too, procedure that is simpler written in a reverse order). Reversing checkpoint procedure order, every lock on the tree is taken in the same direction thus avoiding deadlock.
eolivelli commented 8 months ago

it would be great to run the long running performance tests with this change before cutting a release. nasty things may happen in the long run

diegosalvi commented 8 months ago

I did a first quick look.

I am surprised that we are touching so many method but we are adding only one test case. Is there any corner case that we are touching that is not covered by tests ?

I am asking because we are touching a critical part of the system

There is only one test case to force the old deadlock situation. Other cases are covered by existing tests (and failed many times during development). The biggest change, the checkpoint order change, is covered by a whole test suite about chekpointing (and testing metadata ordering and produced tree) and restoring/restarting.

eolivelli commented 8 months ago

@dmercuriali @aluccaroni @diegosalvi we can merge this patch. My point is about not cutting a release before running appropriate load testing.

diegosalvi commented 8 months ago

Currently I'm running a test instance with some load, let's see in some days if it doesn't result in some errors

diegosalvi commented 8 months ago

With @hamadodene we have run a working instance for a week with load without any issue.

eolivelli commented 8 months ago

Great work!