Extremely high memory usage on RPC Node r-10.1-mainnet with parity/substrate-api-sidecar v17.1.0

fanbsb commented 1 year ago

Did you read the documentation and guides?

[X] I have inspected the documentation.

Is there an existing issue?

[X] I have searched the existing issues.

Description of the problem

Hi

This is Philip from Chainge team, we setup an Azero RPC node for cross chain purpose on AWS EKS, however we keep experience out of memory issue after node running, even we are using 32G memory instance, I will describe the steps and spec detail below.

Information on your setup.

Running on Mainnet
Version: 2.1 Aleph-node version, public.ecr.aws/p6e8q1z1/aleph-node:r-10.1-mainnet 2.2 Sidecar version, parity/substrate-api-sidecar:v17.1.0
Running on AWS EKS, one node with 2 containers, aleph node and sidecar
RPC node
ENV Flag for Aleph env:
- name: NAME value: "chainge_node"
- name: CHAIN value: "mainnet"
- name: BASE_PATH value: "/data/azero"
- name: NODE_KEY_PATH value: "/data/azero/p2p_secret"
- name: PORT value: "30334"
- name: RPC_PORT value: "9934"
- name: POOL_LIMIT value: "100"
- name: VALIDATOR value: "false"
- name: TELEMETRY_ENABLED value: "true"
- name: RUST_LOG value: "info" ENV flag for sidecar env:
- name: SAS_EXPRESS_PORT value: "9956"
- name: SAS_SUBSTRATE_URL value: "ws://localhost:9943"
Operating System: Docker container, under AWS Linux 2 system
AWS r6i.xlarge (32G memory) and r6i.2xlarge (64G memory)

Steps to reproduce

I start to download the snapshot data manually from official website, then unpack everything to folder /data/azero/chains/mainnet
Start the node with env flag above with r6i.xlarge server, everything looks good, the node is syncing the block from p2p network
Add the sidecar in yaml file and restart the node, 2 Containers running on the same instance, everything looks good
Test some sidecar RPC to make sure the node response correct data.
Our system start to invoke the sidecar RPC for cross chain purpose, then the memory usage of aleph node start to increase , it will eat at least 30G memory in 10 minutes, then EKS kill the container because out of memory
We replace the instance to r6i.2xlarge, the same thing happen again, it eat at least 40G memory in 10 minutes, then stable at 45-50G level.
The node is syncing block until OOM occurred.

MiB Mem : 63275.5 total, 425.0 free, 53201.3 used, 9649.2 buff/cache MiB Swap: 0.0 total, 0.0 free, 0.0 used. 9482.5 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
9 root 20 0 248.1g 51.1g 37128 S 3.7 82.6 213:40.56 aleph-node --validator --execution Native --name chainge_node --base-path /data/azero --pool-limit 1+ 1 root 20 0 4360 3384 3100 S 0.0 0.0 0:00.00 bash ./docker_entrypoint.sh

Flag of RPC node --validator --execution Native --name chainge_node --base-path /data/azero --pool-limit 100 --chain mainnet --node-key-file /data/azero/p2p_secret --backup-path /data/azero/backup-stash --rpc-port 9934 --ws-port 9943 --port 30334 --validator-port 30343 --rpc-cors all --no-mdns --ws-max-connections 100 --unsafe-ws-external --unsafe-rpc-external --enable-log-reloading --db-cache 1024 --runtime-cache-size 2 --max-runtime-instances 8 --prometheus-external --telemetry-url wss://telemetry.polkadot.io/submit/ 0 --rpc-methods Safe --public-validator-addresses 192.0.2.1:30343 --unit-creation-delay=300

Log from node 2023-07-03 11:33:47 ✨ Imported #51728417 (0x2b61…f303)
2023-07-03 11:33:48 💤 Idle (37 peers), best: #51728417 (0x2b61…f303), finalized #51728413 (0xad0c…4b12), ⬇ 39.0kiB/s ⬆ 32.6kiB/s
2023-07-03 11:33:48 ✨ Imported #51728418 (0x02da…4067)
2023-07-03 11:33:49 ✨ Imported #51728419 (0x855d…ca92)
2023-07-03 11:33:50 ✨ Imported #51728420 (0xaf5e…e375)
2023-07-03 11:33:51 ✨ Imported #51728421 (0x6a9f…8227)
2023-07-03 11:33:52 ✨ Imported #51728422 (0xb10f…31f2)
2023-07-03 11:33:53 💤 Idle (37 peers), best: #51728422 (0xb10f…31f2), finalized #51728414 (0x8f2a…6e16), ⬇ 29.7kiB/s ⬆ 20.3kiB/s
2023-07-03 11:33:53 ✨ Imported #51728423 (0x1f85…a951)
2023-07-03 11:33:54 ✨ Imported #51728424 (0xdc84…6892)

Did you attach relevant logs?

[X] I have attached logs (if relevant).

DamianStraszak commented 1 year ago

Thank for the detailed report. Can I ask to run just the node (without the sidecar) to make sure that its working fine without the sidecar?

I think it's the sidecar that's spamming the node with RPC queries, which fills the node's cache. I think it's possible to improve the situation by changing some of the flags. But for starters it would be good if you could answer the above question.

fanbsb commented 1 year ago

Thanks, Damian At the beginning the node running as normal without rpc query, it just sync the block quietly. But you know, the node itself doesn't provide too much useful RPC like query balance by address, that's why we need sidecar. I can check the polkadot node also, to find if have the same issue.

DamianStraszak commented 1 year ago

Yes, I understand, this makes sense. We need to test it to see if one can decrease the memory consumption somehow when running along sidecar.

To experience this large memory consumption, did you just run the sidecar alone, or did you also send some queries? This info would be useful for me to reproduce.

fanbsb commented 1 year ago

We run the sidecar container with azero container altogether, in one aws instance. I just enable the log level in sidecar, the query send from application list below

2023-07-04 03:50:43 http: GET /blocks/51786977? 200 9ms 2023-07-04 03:50:43 http: GET /blocks/51786978? 200 8ms 2023-07-04 03:50:43 http: GET /blocks/51786979? 200 8ms 2023-07-04 03:50:44 http: GET /blocks/51786980? 200 7ms 2023-07-04 03:50:44 http: GET /blocks/51786981? 200 9ms

I think those simple requests to query the block shouldn't be root cause, I will keep monitoring the query log

DamianStraszak commented 1 year ago

Hi! I'm still trying to reproduce. Could you please provide the command you use to run the sidecar docker? Do you have a .env.docker file? Could you also share the contents?

fanbsb commented 1 year ago

I attached the k8s stateful set yaml file below

` statefulSet.txt

apiVersion: apps/v1 kind: StatefulSet metadata: name: azero labels: app.kubernetes.io/name: azero spec: serviceName: azero-node replicas: 1 selector: matchLabels: app.kubernetes.io/name: azero template: metadata: labels: app.kubernetes.io/name: azero spec: containers:

name: azero image: public.ecr.aws/p6e8q1z1/aleph-node:r-10.1-mainnet ports:
- name: azero-http-rpc containerPort: 9934 protocol: TCP
- name: azero-p2p-tcp containerPort: 30334 protocol: TCP env:
- name: NAME value: "chainge_node"
- name: CHAIN value: "mainnet"
- name: BASE_PATH value: "/data/azero"
- name: NODE_KEY_PATH value: "/data/azero/p2p_secret"
- name: PORT value: "30334"
- name: RPC_PORT value: "9934"
- name: POOL_LIMIT value: "100"
- name: VALIDATOR value: "false"
- name: TELEMETRY_ENABLED value: "true"
- name: RUST_LOG value: "info" volumeMounts:
- name: azero-data mountPath: /data/azero
name: sidecar image: parity/substrate-api-sidecar:v17.1.0 ports:
- name: azside-http-rpc containerPort: 9956 protocol: TCP env:
- name: SAS_EXPRESS_PORT value: "9956"
- name: SAS_SUBSTRATE_URL value: "ws://localhost:9943"
- name: SAS_LOG_LEVEL value: "http" nodeSelector: group: azero volumes:
name: azero-data persistentVolumeClaim: claimName: alephzero-data-disk `

fanbsb commented 1 year ago

Updates from yesterday I change sidecar log level to http so I can see the query log, after restart the Pod, the memory usage looks normal, without increasing dramatically.

Here is the output of free command, you can see only 6-7G around after one day running

root@azero-0:/node# free -h total used free shared buff/cache available Mem: 61Gi 6.9Gi 51Gi 3.0Mi 3.5Gi 54Gi Swap: 0B 0B 0B

DamianStraszak commented 1 year ago

Hi, thanks for the details. I experimented quite a lot and generally I cannot reproduce the high memory usage.

I used c5.2xlarge which has 16GB RAM, and could run your setup without issue (I didn't use EKS though).

My observations: when I'm syncing from snapshot then indeed the node has increased RAM consumption -- in my case it was ~12GB at peek. After syncing, I run the sidecar, and run a number of queries, and left it running for a long time, and it stayed at ~12GB.

Then I restarted aleph-node, after which it started using much less memory (note that it was already fully synced), in fact after >24h it stabilized at around 6GB RAM. The sidecar is running all the time, and I have sent lots of queries to it as well.

So what I can say is that 16GB RAM on c5.2xlarge is definitely enough, and I have no idea what are the issues you are facing.

I saw that the situation improved for you. Let us know if the issues repeat... In such a case, try whether restarting the aleph-node docker does the job.

fanbsb commented 1 year ago

Thanks, Damian, today I check the memory usage, it's 7.1G used, so looks good, I will keep monitor few more days, then downgrade the instance.

Actually in the last memory shortage, k8s will kill the container automatically and restart the container, the new container start to resync the block and RPC query also coming together, then the new container start to eat the memory quickly and shortage comes again, this make container automatically restart again and again.

I guess this is a rare case that maybe certain RPC query + block syncing, will cause the memory shortage and restart loop, since I don't know which RPC query will cause the problem, I will keep monitoring the status and if it comes again, I can find the query in the sidecar log.

fanbsb commented 11 months ago

The issue appear again in the recently 3 days, memory usage quickly raise to full then killed by container.

Shall I upgrade to the latest release?

fbielejec commented 11 months ago

Thanks for the report we'll look into the issue shortly and try to reproduce

fbielejec commented 11 months ago

Could you please give us more context, specifically what is the applications in question doing and what kind of queries is it making to the sidecar? Any detail that helps in reproducing the problem would be priceless

piotrMocz commented 11 months ago

Also, please upgrade to version 11.4. We definitely recommend keeping the node at the latest version (as seen on the releases page).

fanbsb commented 11 months ago

Thanks, we have upgraded to 11.3 one weeks ago, looks stable now, will keep monitoring the node

woocash2 commented 8 months ago

@fanbsb, please let us know if the problem reappeared. We'll keep the issue open until 22nd October, and then we will close it if we see no further updates.

Cardinal-Cryptography / aleph-node-issues