Closed abizjak closed 3 years ago
I think it's generally true that outside of the write lock, calling bpParent
is not safe, since a live block may become dead. In particular, the following uses may not be safe:
Concordium.Queries.getBranches
Concordium.Queries.getBlockInfo
Concordium.Queries.getAncestors
Concordium.Skov.Query.leavesBranches
These should be made safe. For getBranches
and getBlockInfo
, only the hash is required, so we could add a safe function for getting the parent hash. For getAncestors
, we could have a bpParentSafe
that returns a Maybe
, and simply conclude that the block is dead when we get Nothing
. leavesBranches
can also probably rely on just the hashes.
While I agree that bpParent in the current incarnation is not safe and changes should be made, I am not so sure all uses you list are unsafe since (1) execution of each of those queries takes a snapshot of SkovPersistentData which in principle should be consistent and (2) LMDB database only ever has data written to, never removed and (3) finalized blocks are never rolled back.
But whether this is the case depends a bit on the semantics of IORef which seems hard to pin down.
I ran a testnet node overnight while querying
making 10 requests in parallel as quickly as possible (the node by increasing block height (in bathes of 10) while the node was catching up (on testnet).
No failed queries and no crashes were observed.
The node was running with -N8
for the Haskell runtime flags and average CPU usage was 300%-400%.
So it seems that the query part is pretty stable.
Bug Description
(this is on testnet during normal "start node and catch up")
Steps to Reproduce
This bug is not reliably reproducible. I have not seen it on my workstation at work, but trying to catch up on this machine (at home) triggers it somewhat consistently (after a few minutes of catchup, depending on what peers I get).
The bug seems to depend on
Expected Result
The node does not crash.
Actual Result
The node sometimes crashes during catchup.
Versions