anza-xyz / agave

Web-Scale Blockchain for fast, secure, scalable, decentralized apps and marketplaces.
https://www.anza.xyz/
Apache License 2.0
217 stars 85 forks source link

RPC Node Memory Leak Frequently #1908

Open KaiQiu9527 opened 1 week ago

KaiQiu9527 commented 1 week ago

Problem

Proposed Solution

Hi Dev Team: I'm running a solana rpc node in Hong Kong Region on Huawei Cloud. It was running well on the first few hours, while it will became fall behind and the memory usage will grow rapidly until being killed for OOM. I have try different specifications, like 32u256g and 64u512g, they both crashed for OOM. Is there any way to find out the reason? image

Specification: Machine: m7n.16xlarge.8 (64U512g, 3rd Generation Intel® Xeon® Scalable Processor) Disk: 2T ESSD(I/O throughput up to 1000MB/S) Network: 10Gbps

validator.sh is:

!/bin/bash

exec solana-validator \ --identity /home/sol/validator-keypair.json \ --known-validator 7Np41oeYqPefeNQEHSv1UDhYrehxin3NStELsSKCT4K2 \ --known-validator GdnSyH3YtwcxFvQrVVJMm1JhTS4QVX7MFsX56uJLUfiZ \ --known-validator DE1bawNcRJB9rVm3buyMVfr8mBEoyyu73NBovf2oXJsJ \ --known-validator CakcnaRDHka2gXyfbEd2d3xsvkJkqsLw2akB3zsN1D2S \ --full-rpc-api \ --no-voting \ --ledger /mnt/solana/ledger \ --accounts /mnt/solana/accounts \ --log /home/sol/solana-rpc.log \ --rpc-port 8899 \ --rpc-bind-address 0.0.0.0 \ --private-rpc \ --dynamic-port-range 8000-8020 \ --entrypoint entrypoint.mainnet-beta.solana.com:8001 \ --entrypoint entrypoint2.mainnet-beta.solana.com:8001 \ --entrypoint entrypoint3.mainnet-beta.solana.com:8001 \ --entrypoint entrypoint4.mainnet-beta.solana.com:8001 \ --entrypoint entrypoint5.mainnet-beta.solana.com:8001 \ --expected-genesis-hash 5eykt4UsFv8P8NJdTREpY1vzqKqZKvdpKuc147dw2N9d \ --wal-recovery-mode skip_any_corrupted_record \ --limit-ledger-size \ --maximum-local-snapshot-age 20000

KaiQiu9527 commented 1 week ago

The last minute log before OOM is attached: solana-rpc-last-1-minute.log

jie35752321 commented 1 week ago

Get a machine with more memory

KaiQiu9527 commented 1 week ago

Get a machine with more memory

The recomment specification is 512G mem, while it was running well on 64U256G machine on GPC. It's weried.

jie35752321 commented 1 week ago

Don't believe the recomment, the first time I used 512Gmem, and then I couldn't catch up with the height, how to fix it can't make, change to 768G good

KaiQiu9527 commented 1 week ago

Don't believe the recomment, the first time I used 512Gmem, and then I couldn't catch up with the height, how to fix it can't make, change to 768G good

What's the usage of memory when you changed to 768G?

jie35752321 commented 1 week ago

In the case of swap, it is now 65%

KaiQiu9527 commented 1 week ago

In the case of swap, it is now 65%

Hi bro, where are you from? Can we DM? I need some experience on solana deployment from you😊

rafaelsantana-mb commented 3 days ago

@KaiQiu9527 I'm sharing with you an experience I had recently...

Maybe running validator using this param --no-skip-initial-accounts-db-clean can help you avoid OOM on hardware with less Memory.

It'll force process accounts-db before start syncing. The down side of it is your node will become far slots behind after any restart on validator and take longer to catchup latest slots. So it's a trade-off.