Concordium / concordium-node

The main concordium node implementation.
GNU Affero General Public License v3.0
45 stars 22 forks source link

Block parent is sometimes deemed missing #179

Closed abizjak closed 3 years ago

abizjak commented 3 years ago

Bug Description

2021-10-03T12:23:07.861279649Z: INFO: Skov: Finalization statistics: finalizationCount=26585 lastFinalizedTime=1.6332637878612645e9 finalizationPeriodEMA=Just 1.4626803686410766e-2 finalizationPeriodEMSD=Just 4.948750310804371e-2
2021-10-03T12:23:07.861645200Z: DEBUG: Skov: Block 5533cb484c6ffa4a09060dad06aeac8a50fa30cb04fd4ae901fba8b30684764b marked finalized
2021-10-03T12:23:07.861660720Z: ERROR: TreeState: Failed to find block d3b55ae6d0fd21e889fa3d6798da704324d355dfa54d64c3e168b7f6a3a1f5ee
2021-10-03T12:23:07.861667390Z: ERROR: TreeState: Block status: Just (BlockAlive d3b55ae6d0fd21e889fa3d6798da704324d355dfa54d64c3e168b7f6a3a1f5ee)
2021-10-03T12:23:07.861672100Z: ERROR: TreeState: Couldn't find parent block in disk
concordium-node: user error (Couldn't find parent block in disk)
concordium-node: warning: too many hs_exit()s

(this is on testnet during normal "start node and catch up")

Steps to Reproduce

This bug is not reliably reproducible. I have not seen it on my workstation at work, but trying to catch up on this machine (at home) triggers it somewhat consistently (after a few minutes of catchup, depending on what peers I get).

The bug seems to depend on

  1. the vagaries of the Haskell runtime and when it decides to run finalizers for weak pointers. We seem to be using them incorrectly since we have weak pointers to compound types, and documentation explicitly states that
    WARNING: weak pointers to ordinary non-primitive Haskell types are particularly fragile, because the compiler is free to optimise away or duplicate the underlying data structure. Therefore attempting to place a finalizer on an ordinary Haskell type may well result in the finalizer running earlier than you expected. This is not a problem for caches and memo tables where early finalization is benign.
  2. the order in which blocks are marked as finalized. If multiple blocks are marked as finalized at the same time then they are marked so from most recent backwards. (See #135 as well). This means that we can be in a situation where a parent of a block that is already marked as finalized is not yet finalized. Combined with (1) and the fact that in https://github.com/Concordium/concordium-node/blob/5204e08188d1c0cf1954f1c3cc551d45bcb3fca5/concordium-consensus/src/Concordium/GlobalState/Persistent/TreeState.hs#L425 we only consider the cases where the weak pointer is not null, or the block is finalized, means we sometimes try to look up a live, but non-finalized block while trying to get the block parent.

Expected Result

The node does not crash.

Actual Result

The node sometimes crashes during catchup.

Versions

td202 commented 3 years ago

I think it's generally true that outside of the write lock, calling bpParent is not safe, since a live block may become dead. In particular, the following uses may not be safe:

These should be made safe. For getBranches and getBlockInfo, only the hash is required, so we could add a safe function for getting the parent hash. For getAncestors, we could have a bpParentSafe that returns a Maybe, and simply conclude that the block is dead when we get Nothing. leavesBranches can also probably rely on just the hashes.

abizjak commented 3 years ago

While I agree that bpParent in the current incarnation is not safe and changes should be made, I am not so sure all uses you list are unsafe since (1) execution of each of those queries takes a snapshot of SkovPersistentData which in principle should be consistent and (2) LMDB database only ever has data written to, never removed and (3) finalized blocks are never rolled back.

But whether this is the case depends a bit on the semantics of IORef which seems hard to pin down.

abizjak commented 3 years ago

I ran a testnet node overnight while querying

making 10 requests in parallel as quickly as possible (the node by increasing block height (in bathes of 10) while the node was catching up (on testnet).

No failed queries and no crashes were observed.

The node was running with -N8 for the Haskell runtime flags and average CPU usage was 300%-400%.

So it seems that the query part is pretty stable.