Consensys / quorum

A permissioned implementation of Ethereum supporting data privacy
https://www.goquorum.com/
GNU Lesser General Public License v3.0
4.66k stars 1.28k forks source link

Sudden shutdown of Quorum container with Killed statement #1580

Open Purbaja opened 1 year ago

Purbaja commented 1 year ago

Quorum container suddenly (running for more than a year) got exited in production with Killed statement. The VM was not restarted. Both tessera and quorum docker containers were running in that VM, and only quorum container got exited with below error -

2022-12-06 07:02:46.619203 I | rafthttp: peer 1 became inactive (message send to peer failed) 2022-12-06 07:02:46.653081 I | rafthttp: peer 1 became active 2022-12-06 07:02:46.653125 I | rafthttp: established a TCP streaming connection with peer 1 (stream Message reader) 2022-12-06 07:02:47.366548 I | rafthttp: peer d became active 2022-12-06 07:02:47.366636 I | rafthttp: established a TCP streaming connection with peer d (stream Message reader) 2022-12-06 07:02:47.843520 W | rafthttp: closed an existing TCP streaming connection with peer 1 (stream MsgApp v2 writer) 2022-12-06 07:02:47.843549 I | rafthttp: established a TCP streaming connection with peer 1 (stream MsgApp v2 writer) 2022-12-06 07:02:47.908891 W | rafthttp: closed an existing TCP streaming connection with peer e (stream MsgApp v2 writer) 2022-12-06 07:02:47.908918 I | rafthttp: established a TCP streaming connection with peer e (stream MsgApp v2 writer) 2022-12-06 07:02:48.213757 I | rafthttp: established a TCP streaming connection with peer b (stream Message reader) 2022-12-06 07:02:48.774699 I | rafthttp: established a TCP streaming connection with peer 2 (stream MsgApp v2 reader) 2022-12-06 07:02:49.147878 W | rafthttp: lost the TCP streaming connection with peer 2 (stream Message reader) 2022-12-06 07:02:49.147952 E | rafthttp: failed to read 2 on stream Message (read tcp x.x.x.11:46466->x.x.x.58:50401: i/o timeout) 2022-12-06 07:02:49.147962 I | rafthttp: peer 2 became inactive (message send to peer failed) 2022-12-06 07:02:49.637142 I | rafthttp: peer 2 became active 2022-12-06 07:02:49.637181 W | rafthttp: closed an existing TCP streaming connection with peer 2 (stream Message writer) 2022-12-06 07:02:49.637188 I | rafthttp: established a TCP streaming connection with peer 2 (stream Message writer) 2022-12-06 07:02:50.203643 I | rafthttp: established a TCP streaming connection with peer 2 (stream Message reader) Killed

What could be possible reasons for this? what metrics can be enabled to monitor this quorum node further?

Purbaja commented 1 year ago

@baptiste-b-pegasys Only thing we could found from the Azure log analytics is that the Logical Disk MB/s (read-write operations from the Disk) is extending beyond the supported capacity of the VM. Supported capacity is 64 MB/s whereas actual has a spike of 133 MB/s which is when the Quorum container got restarted. And from Azure support team we got to see some high memory utilization in the VM during the time when the Quorum container got exited.

This quorum quorum node is running for more than a year.
Current block height - 44949795 So, suddenly what can cause memory issue or the spike of logical disk MB/s. In the last month we started multiple contract listeners in this quorum network. Can these event listeners cause this spike ?

Purbaja commented 1 year ago

@baptiste-b-pegasys Also noticed that the geth.ipc file is getting newly created at the time of quorum container restart