Consensys / quorum

A permissioned implementation of Ethereum supporting data privacy
https://www.goquorum.com/
GNU Lesser General Public License v3.0
4.68k stars 1.29k forks source link

The memory utilization is increasing in Istanbul BFT #481

Closed hagishun closed 6 years ago

hagishun commented 6 years ago

System information

Expected behaviour

The memory utilization is periodically released.

Actual behaviour

Mamory utilization is increasing. Only one node is periodically released.

Steps to reproduce the behaviour

Evnr

Backtrace

log.zip

[backtrace]
fixanoid commented 6 years ago

@hagishun could you give me some details of the cluster: whats the configuration like and what hardware was used? Also, are you getting same issued when building quorum client from master?

tharun-allu commented 6 years ago

image This shows the memory utilization on the nodes I am running. Looks to me there might be some memory leak. The graph is 1 week utilization.

fixanoid commented 6 years ago

@tharun-allu thanks for the metrics. Whats the load on the chain and how far did it advance blockwise?

tharun-allu commented 6 years ago

@fixanoid the above graphs are from a network of nodes with 16G memory each and block height is 4 million.

I have restarted geth process in a different network yesterday with 7 million blocks and attached is the memory graph of the nodes (there are 4 in the network image

What I noticed is 2 of the nodes shot up their memory and one node died and other will soon die.

namtruong commented 6 years ago

Hi @tharun-allu I've been trying to replicate this issue but unable to. Did you get this only after it reached 4 million blocks? From the graph attached it seems quite stable for sometime before going up - was there any incident observed in the log?

tharun-allu commented 6 years ago

image @namtruong The 2nd graph was from my development network. Attached is the same nodes graphs for this week. Unfortunately I only implemented monitoring for blockheight last week and the block height for the second graph is 7.9 million now.

My suspicion is that the more transactions that go through the network the faster the growth is. I currently run 3 sets of networks and the rate of growth seems to correlate with how busy (transactions) the network is. If you want me to give you any additional data or collect any new metrics I can do that and post here.

Also to reduce confusion, I can only post data from only one environment. Let me know your thoughts

namtruong commented 6 years ago

@tharun-allu thank you for the info.

I've put up a change for this - https://github.com/namtruong/quorum/tree/bugfix/istanbul-storeBacklog-memory-leak Could you please test it on the branch and let me know if this has fixed the issue?

Many thanks!

tharun-allu commented 6 years ago

image This is the latest pattern, I have not tested the fix yet. As I notice only one node jumping up and then sort of stable after. I am going to restart that node to see if other nodes behave differently. I will keep you posted on my observations.

tharun-allu commented 6 years ago

@namtruong I updated my dev network with the code from branch and I will keep you updated how it goes today.

# ./geth version
Geth
Version: 1.7.2-stable
Git Commit: 891c6c5e5c2a38c2a2982587bc12b282422929a4
Quorum Version: 2.1.0
Architecture: amd64
Go Version: go1.10.4
Operating System: linux
GOPATH=
namtruong commented 6 years ago

@tharun-allu have you got any update?

tharun-allu commented 6 years ago

image

Looks like this has resolved this issue. I will confirm by downloading the 2.1.0 from upstream and seeing whether the issue come back. I upgraded from 2.0.1 to 2.1.0 from @namtruong branch.

namtruong commented 6 years ago

@tharun-allu thanks for your update. Fyi, me and my colleague were also working on a different patch here https://github.com/jpmorganchase/quorum/compare/master...trung:f-istanbul-backlogs?expand=1 . We're in the process of testing the changes - but they are ultimately solving the same issue. Please feel free to test either of the solutions and let us know your feedback

fixanoid commented 6 years ago

@tharun-allu the new pull that addresses the issue is here: https://github.com/jpmorganchase/quorum/pull/521

tharun-allu commented 6 years ago

image The PR seems to have resolved this issue.