Closed fanbsb closed 8 months ago
Thank for the detailed report. Can I ask to run just the node (without the sidecar) to make sure that its working fine without the sidecar?
I think it's the sidecar that's spamming the node with RPC queries, which fills the node's cache. I think it's possible to improve the situation by changing some of the flags. But for starters it would be good if you could answer the above question.
Thanks, Damian At the beginning the node running as normal without rpc query, it just sync the block quietly. But you know, the node itself doesn't provide too much useful RPC like query balance by address, that's why we need sidecar. I can check the polkadot node also, to find if have the same issue.
Yes, I understand, this makes sense. We need to test it to see if one can decrease the memory consumption somehow when running along sidecar.
To experience this large memory consumption, did you just run the sidecar alone, or did you also send some queries? This info would be useful for me to reproduce.
We run the sidecar container with azero container altogether, in one aws instance. I just enable the log level in sidecar, the query send from application list below
2023-07-04 03:50:43 http: GET /blocks/51786977? 200 9ms 2023-07-04 03:50:43 http: GET /blocks/51786978? 200 8ms 2023-07-04 03:50:43 http: GET /blocks/51786979? 200 8ms 2023-07-04 03:50:44 http: GET /blocks/51786980? 200 7ms 2023-07-04 03:50:44 http: GET /blocks/51786981? 200 9ms
I think those simple requests to query the block shouldn't be root cause, I will keep monitoring the query log
Hi!
I'm still trying to reproduce. Could you please provide the command you use to run the sidecar docker? Do you have a .env.docker
file? Could you also share the contents?
I attached the k8s stateful set yaml file below
apiVersion: apps/v1 kind: StatefulSet metadata: name: azero labels: app.kubernetes.io/name: azero spec: serviceName: azero-node replicas: 1 selector: matchLabels: app.kubernetes.io/name: azero template: metadata: labels: app.kubernetes.io/name: azero spec: containers:
Updates from yesterday I change sidecar log level to http so I can see the query log, after restart the Pod, the memory usage looks normal, without increasing dramatically.
Here is the output of free command, you can see only 6-7G around after one day running
root@azero-0:/node# free -h total used free shared buff/cache available Mem: 61Gi 6.9Gi 51Gi 3.0Mi 3.5Gi 54Gi Swap: 0B 0B 0B
Hi, thanks for the details. I experimented quite a lot and generally I cannot reproduce the high memory usage.
I used c5.2xlarge which has 16GB RAM, and could run your setup without issue (I didn't use EKS though).
My observations: when I'm syncing from snapshot then indeed the node has increased RAM consumption -- in my case it was ~12GB at peek. After syncing, I run the sidecar, and run a number of queries, and left it running for a long time, and it stayed at ~12GB.
Then I restarted aleph-node, after which it started using much less memory (note that it was already fully synced), in fact after >24h it stabilized at around 6GB RAM. The sidecar is running all the time, and I have sent lots of queries to it as well.
So what I can say is that 16GB RAM on c5.2xlarge is definitely enough, and I have no idea what are the issues you are facing.
I saw that the situation improved for you. Let us know if the issues repeat... In such a case, try whether restarting the aleph-node docker does the job.
Thanks, Damian, today I check the memory usage, it's 7.1G used, so looks good, I will keep monitor few more days, then downgrade the instance.
Actually in the last memory shortage, k8s will kill the container automatically and restart the container, the new container start to resync the block and RPC query also coming together, then the new container start to eat the memory quickly and shortage comes again, this make container automatically restart again and again.
I guess this is a rare case that maybe certain RPC query + block syncing, will cause the memory shortage and restart loop, since I don't know which RPC query will cause the problem, I will keep monitoring the status and if it comes again, I can find the query in the sidecar log.
The issue appear again in the recently 3 days, memory usage quickly raise to full then killed by container.
Shall I upgrade to the latest release?
Thanks for the report we'll look into the issue shortly and try to reproduce
Could you please give us more context, specifically what is the applications in question doing and what kind of queries is it making to the sidecar? Any detail that helps in reproducing the problem would be priceless
Also, please upgrade to version 11.4. We definitely recommend keeping the node at the latest version (as seen on the releases page).
Thanks, we have upgraded to 11.3 one weeks ago, looks stable now, will keep monitoring the node
@fanbsb, please let us know if the problem reappeared. We'll keep the issue open until 22nd October, and then we will close it if we see no further updates.
Did you read the documentation and guides?
Is there an existing issue?
Description of the problem
Hi
This is Philip from Chainge team, we setup an Azero RPC node for cross chain purpose on AWS EKS, however we keep experience out of memory issue after node running, even we are using 32G memory instance, I will describe the steps and spec detail below.
Information on your setup.
Steps to reproduce
MiB Mem : 63275.5 total, 425.0 free, 53201.3 used, 9649.2 buff/cache MiB Swap: 0.0 total, 0.0 free, 0.0 used. 9482.5 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
9 root 20 0 248.1g 51.1g 37128 S 3.7 82.6 213:40.56 aleph-node --validator --execution Native --name chainge_node --base-path /data/azero --pool-limit 1+ 1 root 20 0 4360 3384 3100 S 0.0 0.0 0:00.00 bash ./docker_entrypoint.sh
Flag of RPC node --validator --execution Native --name chainge_node --base-path /data/azero --pool-limit 100 --chain mainnet --node-key-file /data/azero/p2p_secret --backup-path /data/azero/backup-stash --rpc-port 9934 --ws-port 9943 --port 30334 --validator-port 30343 --rpc-cors all --no-mdns --ws-max-connections 100 --unsafe-ws-external --unsafe-rpc-external --enable-log-reloading --db-cache 1024 --runtime-cache-size 2 --max-runtime-instances 8 --prometheus-external --telemetry-url wss://telemetry.polkadot.io/submit/ 0 --rpc-methods Safe --public-validator-addresses 192.0.2.1:30343 --unit-creation-delay=300
Log from node 2023-07-03 11:33:47 ✨ Imported #51728417 (0x2b61…f303)
2023-07-03 11:33:48 💤 Idle (37 peers), best: #51728417 (0x2b61…f303), finalized #51728413 (0xad0c…4b12), ⬇ 39.0kiB/s ⬆ 32.6kiB/s
2023-07-03 11:33:48 ✨ Imported #51728418 (0x02da…4067)
2023-07-03 11:33:49 ✨ Imported #51728419 (0x855d…ca92)
2023-07-03 11:33:50 ✨ Imported #51728420 (0xaf5e…e375)
2023-07-03 11:33:51 ✨ Imported #51728421 (0x6a9f…8227)
2023-07-03 11:33:52 ✨ Imported #51728422 (0xb10f…31f2)
2023-07-03 11:33:53 💤 Idle (37 peers), best: #51728422 (0xb10f…31f2), finalized #51728414 (0x8f2a…6e16), ⬇ 29.7kiB/s ⬆ 20.3kiB/s
2023-07-03 11:33:53 ✨ Imported #51728423 (0x1f85…a951)
2023-07-03 11:33:54 ✨ Imported #51728424 (0xdc84…6892)
Did you attach relevant logs?