Open lahabana opened 4 years ago
I've observed this for all full nodes, but the proxy leaks more rapidly than others.
Since there are no new comments since 2020 i'm not sure if i am experiencing the exact same issue.
I am also experiencing memory leak issues, with proxies, validators and fullnodes (So the Celo Geth Client). 12GB of available memory fill up within 5 to 6 days. The memory usage is at 6GB when first starting, then it's rising by about 1GB per day.
EDIT: Using Docker Image us.gcr.io/celo-org/geth:1.7.4
Which release version are you using? @erNail
Which release version are you using? @erNail
Version 1.7.4 (Docker Image us.gcr.io/celo-org/geth:1.7.4
)
I might have some new information on this. Not sure if this is related to the problem described above though.
In my case i assumed a memory leak, but the issue might be something different. I am running the Celo Geth clients in a Kubernetes environment with containerd as container runtime engine. A memory limit is set for the pod/container.
Here is the problem: The running application is not aware of the memory limits i have set for the pod/container. From my understanding, Go tries to use all the for the available memory and processors. So in the end, the Geth client thinks he has all the memory of the machine it's running on available, without considering the memory limit i have set.
This behavior is also supported by multiple comments in the Celo Discord. These describe that no matter how much memory the machine has, the geth client is always trying to use almost all of it. One user described a memory usage of around 42GB on a 48GB machine.
A possible solution, if Go 1.19 would be used, would be to set a Memory Limit on Application Level via Go Environment variable GOMEMLIMIT
.
Expected Behavior
Proxies shouldn't be leaking memory
Current Behavior
Over a long period of time the proxy grows in memory utilization until we reach an OOM and Geth restarts.
We've seen manifestations of this in both Baklava and Mainnet at multiple times. The historic pattern is very clear.
I've taken a memory dump of a node:
after restart:
after ~12h:
We can see slow growth on things in allocations from
p2p/discv5
.Look at a graph of the first heap:
It seems something might be keeping a lot of
reqQueryFindnode
around. Digging at the code here's my theory:unknown
the call tostart
fails and we calldeferQuery
.DeferQuery
adds to list ofdeferredQueries
that are retried whenever possible by callingstart()
again, if start returns true we remove the query for the list ofDeferedQuery
.I believe the problem is that for some nodes never reach a state where
canQuery
is true which is necessary forstart()
to return true as show here: https://github.com/celo-org/celo-blockchain/blob/master/p2p/discv5/net.go#L798States where
canQuery
is true are:known
,unresponsive
,contested
.We clearly see from
start()
thatunknown
andverifyinit
are reachable. It seems like it's possible to stay locked in a loop ofunknown
,verifyinit
,verifywait
andremoteverifywait
if the node succeeds in doing a fullping
,pong
flow.The result of this meaning we stay with a
deferQuery
that we never get rid of. However, it seems like a node whenever it enters anunknown
state it aborts alldeferredQueries
so it seems like we might get stuck in a loopverifyinit
,verifywait
,remoteverifywait
.There are a few odd things in the state machine:
remoteverifywait
when a ping timeout happens we transition toknown
verifywait
when the pongPacket received is invalid we transition toknown
Neither of these explain what happens. But it really seems that the state machine gets stuck in a state that prevents
deferedQueries
from being removed.System information
Seen on both 1.0.1 and 1.1.0