Memory usage of non-validator nodes grows indefinitely, leading to OOM and unclean shutdown

hhsel commented 1 year ago

Expected behaviour

When running a QBFT cluster, memory usage should stay within a moderate value range as long as the cluster is not busy.

Actual behaviour

Memory usage of QBFT non-validator nodes grows over time at a rate of approx. 50MB/day, if the cluster keeps producing empty blocks every 1 second, for example. Non-validator nodes will be killed by OOM as a result. I have experienced this with 2GB and 4GB nodes, and it took about 1 and 2 months for the nodes to be killed by OOM.

OOM causes an unclean shutdown, which means the node loses its intermediate states that are not persisted to its disk. The memory usage grows indefinitely, even the cluster is producing just empty blocks and does almost nothing on the chain. In my case 8 out of 8 non-validator nodes in the cluster have the same results.

Validator nodes, on the other hand, have similar tendencies but several sudden memory usage drops have been observed (its frequency is not regular nor expectable but about once in 1-2 weeks).

As a result, for non-validator nodes, I must watch its memory usage closely and take nodes out of a load balancer and restart them when memory usage gets high, to avoid OOM.

Steps to reproduce the behaviour

Start a QBFT cluster with an arbitrary number of non-validator nodes, and let the cluster produce empty blocks. Memory usage of non-validator nodes grows indefinitely, causing OOM after some months.

hhsel commented 1 year ago

Executing debug.writeMemProfile() every 15 minutes for 2 months showed that, among go tool pprof entries, the following entries are growing constantly, finally causing OOM:

github.com/ethereum/go-ethereum/consensus/istanbul/validator.newDefaultSet
github.com/ethereum/go-ethereum/consensus/istanbul/validator.New (inline)

validator.newDefaultSet calls policy.RegisterValidatorSet() and add a validator set to validator set list called registry inside ProposerPolicy. It seems that this function is called at least once in the block production process. It seems that the only way to clear this array is calling ClearRegistry(), but this is only called in Commit(), so non-validator nodes does not call this function, accumulating validator sets indefinitely.

I don't know why validator nodes also accumulating these validator sets in the same manner and sudden memory usage drops occur, because validator nodes call Commit() every block so registry is expected to keep relarively small.

naoki4 commented 2 months ago

Hello I am also experiencing the same phenomenon. In my case, the validator node's memory usage increased by 1.4GB over 30 days and eventually got killed with OOMKilled. I also noticed sudden memory usage drops once every 1-2 weeks. The OOMKilled seems to have caused an inconsistency in the data, and the event reported in #1718 has occurred. It seems Patch has been created in another repository. Are there any plans to fix this phenomenon in the Quorum repository?

Also, go-ethereum had a memory leak that appears to have been fixed in v1.10.9. https://github.com/ethereum/go-ethereum/issues/23195

GoQuorum has included go-ethereum v1.10.3 in the v22.7.1 release. Are there any plans to include v1.10.9 or later in future releases?

Consensys / quorum