Concordium / concordium-node

The main concordium node implementation.
GNU Affero General Public License v3.0
45 stars 22 forks source link

The node does not shut down gracefully when queries are in flight #183

Closed abizjak closed 2 years ago

abizjak commented 3 years ago

Bug Description

If there are active queries while the node receives the signal to shut down (SIGINT to be precise) then it generally terminates via a segmentation fault.

2021-10-04T19:02:58.078279751Z: DEBUG: Drained the Consensus outbound low priority queue for 1 element(s)
2021-10-04T19:02:58.078294841Z: DEBUG: Drained the Consensus outbound high priority queue for 0 element(s)
2021-10-04T19:02:58.078299221Z: DEBUG: Drained the Consensus inbound low priority queue for 1 element(s)
2021-10-04T19:02:58.078302781Z: DEBUG: Drained the Consensus inbound high priority queue for 0 element(s)
concordium-node: concordium-node: getBlockSummary: interruptedconcordium-node: getBlockSummary: interrupted
getBlockSummary: interrupted
concordium-node: getBlockSummary: interrupted

concordium-node: getBlockSummary: interrupted
FATAL: exception not rethrown
FATAL: exception not rethrown
FATAL: exception not rethrown
FATAL: exception not rethrown
FATAL: exception not rethrown
[1]    467136 abort (core dumped)  cargo run --release -- --bootstrap-node bootstrap.testnet.concordium.com:8888

The reason the "interrupted" messages are there is that the Haskell runtime is shut down (via hs_exit) while there are active Haskell computations (the queries). This is fine, albeit not pretty.

The segmentation fault seems to happen because some Haskell functions are called (some queries) after hs_exit is called, which is a violation.

To fix this we need to wait with shutting down the Haskell runtime until after the RPC server has been shut down (if it is alive in the first place).

Steps to Reproduce

Run a node. Make queries against the node, e.g., via concordium client or similar. Shut down the node while an active query is in progress. This is easiest to achieve with a slow query such as block summary.

Expected Result

Queries are cancelled or completed, the node shuts down normally.

Actual Result

The node triggers a segmentation fault in most cases.

Versions

abizjak commented 3 years ago

It is not only queries, when a node is receiving transactions and is interrupted the same happens.