0xPolygon / polygon-edge

A Framework for Building Ethereum-compatible Blockchain Networks
https://polygon.technology/solutions/polygon-edge/
Apache License 2.0
1.01k stars 524 forks source link

Memory leak #91

Closed mrwillis closed 3 years ago

mrwillis commented 3 years ago

Memory leak

Description

It seems like there is a memory leak somewhere in the code. Happens after a while of producing blocks leaving us to believe it's something related to block production. CPU usage remains constant.

Your environment

Steps to reproduce

Expected behaviour

No memory leak

Actual behaviour

Memory leak causing an OOM eerror causing a crash

Logs

9053 2021-07-20T16:42:52.019Z [INFO]  polygon.consensus.ibft.acceptState: Accept state: sequence=2210
 9054 2021-07-20T16:42:52.019Z [INFO]  polygon.consensus.ibft: current snapshot: validators=4 votes=0
 9055 2021-07-20T16:42:52.020Z [INFO]  polygon.consensus.ibft: proposer calculated: proposer=0x2d48d27cf3eec87c82f39527934df7bd2124b30a block=2210
 9056 2021-07-20T16:42:52.020Z [DEBUG] polygon.consensus.syncer: enqueue block: peer=16Uiu2HAm1k16mVYWaG1aoG7rZkbuaETqZ6vhV85vB9vtsjXmYbyz number=2209 hash=0x921643448243c439e2db6c1414c7254f2f4e05d78aa7ecb742bec4cd04259a36
 9057 fatal error: runtime: out of memory
 9058
 9059 runtime stack:
 9060 runtime.throw(0x1006f20, 0x16)
 9061         /usr/local/go/src/runtime/panic.go:1117 +0x72
 9062 runtime.sysMap(0xc008000000, 0x4000000, 0x1a90370)
 9063         /usr/local/go/src/runtime/mem_linux.go:169 +0xc6
 9064 runtime.(*mheap).sysAlloc(0x1a76840, 0x400000, 0x42dc97, 0x1a76848)
 9065         /usr/local/go/src/runtime/malloc.go:729 +0x1e5
 9066 runtime.(*mheap).grow(0x1a76840, 0x4, 0x0)
 9067         /usr/local/go/src/runtime/mheap.go:1346 +0x85
 9068 runtime.(*mheap).allocSpan(0x1a76840, 0x4, 0xc007ff0003, 0x7fc0c5fea668)
 9069         /usr/local/go/src/runtime/mheap.go:1173 +0x609
 9070 runtime.(*mheap).allocManual(0x1a76840, 0x4, 0xc007ff6803, 0x4239b2)
 9071         /usr/local/go/src/runtime/mheap.go:943 +0x47
 9072 runtime.getempty.func1()
 9073         /usr/local/go/src/runtime/mgcwork.go:374 +0x45
 9074 runtime.getempty(0xc0055f67c0)
 9075         /usr/local/go/src/runtime/mgcwork.go:373 +0x14e
 9076 runtime.(*stackScanState).addObject(0x7fc0c5feab78, 0xc0055f6758, 0xe1f600)
 9077         /usr/local/go/src/runtime/mgcstack.go:278 +0x113
 9078 runtime.scanframeworker(0x7fc0c5feaa90, 0x7fc0c5feab78, 0xc000045698)
 9079         /usr/local/go/src/runtime/mgcmark.go:945 +0x149
 9080 runtime.scanstack.func1(0x7fc0c5feaa90, 0x0, 0x18ea7c0)
 9081         /usr/local/go/src/runtime/mgcmark.go:746 +0x3d
 9082 runtime.gentraceback(0xffffffffffffffff, 0xffffffffffffffff, 0x0, 0xc0028d4d80, 0x0, 0x0, 0x7fffffff, 0x7fc0c5feacd8, 0x0, 0x0, ...)
 9083         /usr/local/go/src/runtime/traceback.go:322 +0x1135
 9084 runtime.scanstack(0xc0028d4d80, 0xc000045698)
 9085         /usr/local/go/src/runtime/mgcmark.go:749 +0x17c
 9086 runtime.markroot.func1()
 9087         /usr/local/go/src/runtime/mgcmark.go:233 +0xc6
 9088 runtime.markroot(0xc000045698, 0x6fb)
 9089         /usr/local/go/src/runtime/mgcmark.go:206 +0x33e
 9090 runtime.gcDrain(0xc000045698, 0x7)
 9091         /usr/local/go/src/runtime/mgcmark.go:1014 +0x118
 9092 runtime.gcBgMarkWorker.func2()
 9093         /usr/local/go/src/runtime/mgc.go:2003 +0x17e
 9094 runtime.systemstack(0xc004a5d200)
 9095         /usr/local/go/src/runtime/asm_amd64.s:379 +0x66
 9096 runtime.mstart()
 9097         /usr/local/go/src/runtime/proc.go:1246
 9098
 9099 goroutine 6 [GC worker (idle), 2 minutes]:
 9100 runtime.systemstack_switch()
 9101         /usr/local/go/src/runtime/asm_amd64.s:339 fp=0xc000054760 sp=0xc000054758 pc=0x46cd60
 9102 runtime.gcBgMarkWorker()
 9103         /usr/local/go/src/runtime/mgc.go:1967 +0x1c7 fp=0xc0000547e0 sp=0xc000054760 pc=0x41f667
 9104 runtime.goexit()
 9105         /usr/local/go/src/runtime/asm_amd64.s:1371 +0x1 fp=0xc0000547e8 sp=0xc0000547e0 pc=0x46eba1
 9106 created by runtime.gcBgMarkStartWorkers
 9107         /usr/local/go/src/runtime/mgc.go:1835 +0x37
 9108
 9109 goroutine 1 [chan receive, 25 minutes]:
 9110 github.com/0xPolygon/minimal/command/helper.HandleSignals(0xc003179660, 0x1247b98, 0xc00082a8d0, 0xc0007b4bd0)
 9111         /home/ec2-user/polygon-sdk/command/helper/helper.go:181 +0x108
 9112 github.com/0xPolygon/minimal/command/server.(*ServerCommand).Run(0xc00000ecf0, 0xc000032140, 0xf, 0x10, 0xc00061b250)
 9113         /home/ec2-user/polygon-sdk/command/server/server_command.go:201 +0x1ce
 9114 github.com/mitchellh/cli.(*CLI).Run(0xc0000c97c0, 0xc0000c97c0, 0x44299c, 0xc000000300)
 9115         /home/ec2-user/polygon-sdk/vendor/github.com/mitchellh/cli/cli.go:255 +0x41a
 9116 main.Run(0xc000032130, 0x10, 0x11, 0xc00008a058)
 9117         /home/ec2-user/polygon-sdk/main.go:25 +0x96
 9118 main.main()
 9119         /home/ec2-user/polygon-sdk/main.go:12 +0x65
 9120
 9121 goroutine 94 [select, 2 minutes]:
 9122 github.com/libp2p/go-yamux.(*Session).sendLoop(0xc0000ee2a0, 0x0, 0x0)
 9123         /home/ec2-user/polygon-sdk/vendor/github.com/libp2p/go-yamux/session.go:474 +0x1bc
 9124 github.com/libp2p/go-yamux.(*Session).send(0xc0000ee2a0)
 9125         /home/ec2-user/polygon-sdk/vendor/github.com/libp2p/go-yamux/session.go:420 +0x2b
 9126 created by github.com/libp2p/go-yamux.newSession
 9127         /home/ec2-user/polygon-sdk/vendor/github.com/libp2p/go-yamux/session.go:128 +0x3e5
 9128
 9129 goroutine 8 [select, 25 minutes]:
 9130 github.com/ipfs/go-log/writer.(*MirrorWriter).logRoutine(0xc0001250e0)
 9131         /home/ec2-user/polygon-sdk/vendor/github.com/ipfs/go-log/writer/writer.go:71 +0x106
 9132 created by github.com/ipfs/go-log/writer.NewMirrorWriter
 9133         /home/ec2-user/polygon-sdk/vendor/github.com/ipfs/go-log/writer/writer.go:36 +0xb9
 9134
 9135 goroutine 9 [select, 2 minutes]:
 9136 go.opencensus.io/stats/view.(*worker).start(0xc0000f4480)
 9137         /home/ec2-user/polygon-sdk/vendor/go.opencensus.io/stats/view/worker.go:276 +0xcd
 9138 created by go.opencensus.io/stats/view.init.0
 9139         /home/ec2-user/polygon-sdk/vendor/go.opencensus.io/stats/view/worker.go:34 +0x68
 9140
 9141 goroutine 10 [select, 25 minutes]:
 9142 github.com/libp2p/go-libp2p-peerstore/pstoremem.(*memoryAddrBook).background(0xc000159600)
 9143         /home/ec2-user/polygon-sdk/vendor/github.com/libp2p/go-libp2p-peerstore/pstoremem/addr_book.go:93 +0x105
 9144 created by github.com/libp2p/go-libp2p-peerstore/pstoremem.NewAddrBook
 9145         /home/ec2-user/polygon-sdk/vendor/github.com/libp2p/go-libp2p-peerstore/pstoremem/addr_book.go:83 +0x18f
 9146
 9147 goroutine 54 [chan receive, 25 minutes]:
 9148 github.com/0xPolygon/minimal/network.(*Subscription).run(0xc00000fb48)
 9149         /home/ec2-user/polygon-sdk/network/server.go:484 +0x68
 9150 created by github.com/0xPolygon/minimal/network.(*Server).Subscribe
 9151         /home/ec2-user/polygon-sdk/network/server.go:515 +0x145
 9152
 9153 goroutine 12 [select, 2 minutes]:
 9154 github.com/libp2p/go-libp2p-swarm.(*DialBackoff).background(0xc000132a10, 0x12432d8, 0xc00082eb40)
 9155         /home/ec2-user/polygon-sdk/vendor/github.com/libp2p/go-libp2p-swarm/swarm_dial.go:126 +0xee
 9156 created by github.com/libp2p/go-libp2p-swarm.(*DialBackoff).init
 9157         /home/ec2-user/polygon-sdk/vendor/github.com/libp2p/go-libp2p-swarm/swarm_dial.go:119 +0x5d

Last 100:

 google.golang.org/grpc/internal/transport.(*http2Server).HandleStreams(0xc00452e480, 0xc0058a13b0, 0x115d858)
    /home/ec2-user/polygon-sdk/vendor/google.golang.org/grpc/internal/transport/http2_server.go:459 +0x9b
google.golang.org/grpc.(*Server).serveStreams(0xc0007c48c0, 0x124b7b8, 0xc00452e480)
    /home/ec2-user/polygon-sdk/vendor/google.golang.org/grpc/server.go:862 +0x103
google.golang.org/grpc.(*Server).handleRawConn.func1(0xc0007c48c0, 0x124b7b8, 0xc00452e480)
    /home/ec2-user/polygon-sdk/vendor/google.golang.org/grpc/server.go:821 +0x3f
created by google.golang.org/grpc.(*Server).handleRawConn
    /home/ec2-user/polygon-sdk/vendor/google.golang.org/grpc/server.go:820 +0x525
goroutine 7988 [chan send, 10 minutes]:
github.com/0xPolygon/minimal/consensus/ibft.(*Ibft).randomTimeout.func1(0x2540be400, 0xc005271c20)
    /home/ec2-user/polygon-sdk/consensus/ibft/ibft.go:848 +0x45
created by github.com/0xPolygon/minimal/consensus/ibft.(*Ibft).randomTimeout
    /home/ec2-user/polygon-sdk/consensus/ibft/ibft.go:846 +0xd6
goroutine 7984 [select, 10 minutes]:
google.golang.org/grpc/internal/transport.(*controlBuffer).get(0xc005877450, 0x1, 0x0, 0x0, 0x0, 0x0)
    /home/ec2-user/polygon-sdk/vendor/google.golang.org/grpc/internal/transport/controlbuf.go:395 +0xff
google.golang.org/grpc/internal/transport.(*loopyWriter).run(0xc0043ede00, 0x0, 0x0)
    /home/ec2-user/polygon-sdk/vendor/google.golang.org/grpc/internal/transport/controlbuf.go:515 +0x1dd
google.golang.org/grpc/internal/transport.newHTTP2Server.func2(0xc00452e480)
    /home/ec2-user/polygon-sdk/vendor/google.golang.org/grpc/internal/transport/http2_server.go:291 +0xd7
created by google.golang.org/grpc/internal/transport.newHTTP2Server
    /home/ec2-user/polygon-sdk/vendor/google.golang.org/grpc/internal/transport/http2_server.go:288 +0x1110
goroutine 8006 [select, 10 minutes]:
google.golang.org/grpc/internal/transport.(*controlBuffer).get(0xc005dcfcc0, 0x1, 0x0, 0x0, 0x0, 0x0)
    /home/ec2-user/polygon-sdk/vendor/google.golang.org/grpc/internal/transport/controlbuf.go:395 +0xff
google.golang.org/grpc/internal/transport.(*loopyWriter).run(0xc004ae9020, 0x0, 0x0)
    /home/ec2-user/polygon-sdk/vendor/google.golang.org/grpc/internal/transport/controlbuf.go:515 +0x1dd
google.golang.org/grpc/internal/transport.newHTTP2Client.func3(0xc004b7da40)
    /home/ec2-user/polygon-sdk/vendor/google.golang.org/grpc/internal/transport/http2_client.go:391 +0x7b
created by google.golang.org/grpc/internal/transport.newHTTP2Client
    /home/ec2-user/polygon-sdk/vendor/google.golang.org/grpc/internal/transport/http2_client.go:389 +0x11e7
goroutine 8020 [select, 10 minutes]:
google.golang.org/grpc/internal/transport.(*http2Server).keepalive(0xc00452f500)
    /home/ec2-user/polygon-sdk/vendor/google.golang.org/grpc/internal/transport/http2_server.go:979 +0x265
created by google.golang.org/grpc/internal/transport.newHTTP2Server
    /home/ec2-user/polygon-sdk/vendor/google.golang.org/grpc/internal/transport/http2_server.go:299 +0x1135
goroutine 8004 [select, 10 minutes]:
github.com/libp2p/go-ya(0xc0019f38c0, 0xc006850000, 0x8000, 0x8000, 0x0, 0x0, 0x0)
    /home/ec2-user/polygon-sdk/vendor/github.com/libp2p/go-yamux/stream.go:117 +0x197
github.com/libp2p/go-libp2p-yamux.(*stream).Read(0xc0019f38c0, 0xc006850000, 0x8000, 0x8000, 0x11, 0x0, 0x0)
    /home/ec2-user/polygon-sdk/vendor/github.com/libp2p/go-libp2p-yamux/stream.go:14 +0x4d
github.com/libp2p/go-libp2p-swarm.(*Stream).Read(0xc0057f7e30, 0xc006850000, 0x8000, 0x8000, 0x11, 0x0, 0x0)
    /home/ec2-user/polygon-sdk/vendor/github.com/libp2p/go-libp2p-swarm/swarm_stream.go:57 +0x62
github.com/multiformats/go-multistream.(*lazyClientConn).Read(0xc0057f7ea0, 0xc006850000, 0x8000, 0x8000, 0x11, 0x0, 0x0)
    /home/ec2-user/polygon-sdk/vendor/github.com/multiformats/go-multistream/lazyClient.go:68 +0xac
github.com/libp2p/go-libp2p/p2p/host/basic.(*streamWrapper).Read(0xc004ac4540, 0xc006850000, 0x8000, 0x8000, 0x0, 0x800010601, 0x1060100000000)
    /home/ec2-user/polygon-sdk/vendor/github.com/libp2p/go-libp2p/p2p/host/basic/basic_host.go:1008 +0x52
bufio.(*Reader).Read(0xc004ae8e40, 0xc0019c7ed8, 0x9, 0x9, 0x14, 0x7fc0f4bba108, 0x18)
    /usr/local/go/src/bufio/bufio.go:227 +0x222
io.ReadAtLeast(0x122e3c0, 0xc004ae8e40, 0xc0019c7ed8, 0x9, 0x9, 0x9, 0xc006759dd0, 0x45a513, 0x1924c408a7582)
    /usr/local/go/src/io/io.go:328 +0x87
io.ReadFull(...)
    /usr/local/go/src/io/io.go:347
golang.org/x/net/http2.readFrameHeader(0xc0019c7ed8, 0x9, 0x9, 0x122e3c0, 0xc004ae8e40, 0x0, 0x0, 0xc035dca7898b8edd, 0xed3cbafc16)
    /home/ec2-user/polygon-sdk/vendor/golang.org/x/net/http2/frame.go:237 +0x89
golang.org/x/net/http2.(*Framer).ReadFrame(0xc0019c7ea0, 0xc003685d40, 0xc003685d40, 0x0, 0x0)
    /home/ec2-user/polygon-sdk/vendor/golang.org/x/net/http2/frame.go:492 +0xa5
google.golang.org/grpc/internal/transport.(*http2Client).reader(0xc004b7da40)
    /home/ec2-user/polygon-sdk/vendor/google.golang.org/grpc/internal/transport/http2_client.go:1330 +0x185
created by google.golang.org/grpc/internal/transport.newHTTP2Client
    /home/ec2-user/polygon-sdk/vendor/google.golang.org/grpc/internal/transport/http2_client.go:345 +0x1011
goroutine 8003 [chan receive, 10 minutes]:
google.golang.org/grpc.(*addrConn).resetTransport(0xc002a09b80)
    /home/ec2-user/polygon-sdk/vendor/google.golang.org/grpc/clientconn.go:1213 +0x465
created by google.golang.org/grpc.(*addrConn).connect
    /home/ec2-user/polygon-sdk/vendor/google.golang.org/grpc/clientconn.go:843 +0x12a
goroutine 8002 [select, 10 minutes]:
google.golang.org/grpc.(*ccBalancerWrapper).watcher(0xc004529180)
    /home/ec2-user/polygon-sdk/vendor/google.golang.org/grpc/balancer_conn_wrappers.go:69 +0xac
created by google.golang.org/grpc.newCCBalancerWrapper
    /home/ec2-user/polygon-sdk/vendor/google.golang.org/grpc/balancer_conn_wrappers.go:60 +0x172
goroutine 8027 [chan send, 10 minutes]:
github.com/0xPolygon/minimal/consensus/ibft.(*Ibft).randomTimeout.func1(0x2540be400, 0xc00528c7e0)
    /home/ec2-user/polygon-sdk/consensus/ibft/ibft.go:848 +0x45
created by github.com/0xPolygon/minimal/consensus/ibft.(*Ibft).randomTimeout
    /home/ec2-user/polygon-sdk/consensus/ibft/ibft.go:846 +0xd6
goroutine 8025 [chan send, 10 minutes]:
github.com/0xPolygon/minimal/consensus/ibft.(*Ibft).randomTimeout.func1(0x2540be400, 0xc0062bbc20)
    /home/ec2-user/polygon-sdk/consensus/ibft/ibft.go:848 +0x45
created by github.com/0xPolygon/minimal/consensus/ibft.(*Ibft).randomTimeout
    /home/ec2-user/polygon-sdk/consensus/ibft/ibft.go:846 +0xd6
goroutine 8021 [select, 10 minutes]:
github.com/libp2p/go-yamux.(*Stream).Read(0xc0019f3970, 0xc006870000, 0x8000, 0x8000, 0x0, 0x0, 0x0)
    /home/ec2-user/polygon-sdk/vendor/github.com/libp2p/go-yamux/stream.go:117 +0x197
github.com/libp2p/go-libp2p-yamux.(*stream).Read(0xc0019f3970, 0xc006870000, 0x8000, 0x8000, 0x2f, 0x0, 0x0)
    /home/ec2-user/polygon-sdk/vendor/github.com/libp2p/go-libp2p-yamux/stream.go:14 +0x4d
github.com/libp2p/go-libp2p-swarm.(*Stream).Read(0xc0058a42a0, 0xc006870000, 0x8000, 0x8000, 0x2f, 0x0, 0x0)
    /home/ec2-user/polygon-sdk/vendor/github.com/libp2p/go-libp2p-swarm/swarm_stream.go:57 +0x62

Proposed solution

If you have an idea of how to fix this issue, please write it down here, so we can begin discussing it

Kourin1996 commented 3 years ago

Hello, @mrwillis. Thank you for reporting this issue. I investigated this problem in the code of latest develop branch (https://github.com/0xPolygon/polygon-sdk/commit/2f50a194f21d972b542ccd1e1571ccba40c5c810) briefly and found a memory leak issue around gRPC.

This issue prevented GC from clearing buffer in gRPC and buffer was accumulated. (In my check, the heap size was increasing about 50 MB every hour) I think this is the main cause of memory leak currently. I've opened new PR (https://github.com/0xPolygon/polygon-sdk/pull/100) to fix it, so the issue will be fixed shortly.

I saw continuous increase of heap in IBFT package. But I'm currently not sure it will cause memory leak. I will test for longer term to check long-term impact of my code change and other memory leak issue.

Thank you again for your time to have raised the issue.

Kourin1996 commented 3 years ago

Hi, I've opened another PR. I've merged this change and run IBFT nodes in AWS EC2 instance (t2.small). As of now, all 4 nodes have been running without any crashes over 3 days. I'm collecting profiles of the nodes and summarized it.

First chart is the result when I run nodes with the change of PR #100, second one is the result when I run nodes with the change of PR #100 and #105. The x-axis shows time and y-axis shows heap size of each function in kB. Please note the ranges of y-axis are same but the ranges of x-axis are different because of different measurement period. Each line shows the change of heap size in each function.(Please check legends) I picked some the functions which allocate biggest amount of heap.

polygon-sdk-profiles_01_heap_out

polygon-sdk-profiles-2_01_heap_out

The result shows PR #100 is not enough to resolve memory leak and there is no memory leak after I merged both #100 and #105. I'm worried about that leveldb uses much heap and increases it from the beginning. But the change is moving periodically. (As of now, I can't say it's memory leak)

lazartravica commented 3 years ago

Excellent job @Kourin1996 for solving the issue in multiple places and building a handy tool along the way: https://github.com/0xPolygon/go-profile-chart-generator/

I am closing the issue.