anza-xyz / agave

Web-Scale Blockchain for fast, secure, scalable, decentralized apps and marketplaces.
https://www.anza.xyz/
Apache License 2.0
318 stars 141 forks source link

Validator not cleaning, memory grows until OOM #2126

Open mxval opened 2 months ago

mxval commented 2 months ago

Problem

Have observed on versions v1.18.15 and v1.18.18 on mainnet. Running jito, but from observations below I think this an issue in solana/agave. Memory grows rapidly after validator restart:

image

Usually running at steady-state of around 100G, and when restarting (in plot above) it can be seen the memory grows very rapidly until crashing. It appears that the normal account cleaning operation is not happening for some reason. Validator ran for just under 1 hour before running out of memory, without clean_accounts metrics being reported.

Restarting from local snapshot (not archive), this has been happening pretty consistently on last few restarts. To solve for now I have forced the validator to use snapshot archive on restart and added the --no-skip-initial-accounts-db-clean argument, which seems to work at cost of much longer restart times. Unfortunately this argument seems to have no effect if restarting from non-archive snapshot.

EDIT:

Should have stated I have 256GB RAM, this has been plenty when the account cleaning runs.

Proposed Solution

Investigate why accounts are not being cleaned and fix the memory growth issue on start up

mxval commented 2 months ago

Seems others have seen issue as well: https://discord.com/channels/428295358100013066/689412830075551748/1249691656400146454

gritsly commented 1 month ago

Came here to report the same issue, have been testing various configs recently and encountered this. Here are my findings:

I have saved some logs if they are needed though probably on WARN level. Can easily reproduce though.

naviat commented 1 month ago

@gritsly I encountered an OOM error for a node that has 950GB of memory

jstarry commented 3 weeks ago

I ran into this issue as well and pulled some logs that will hopefully be helpful. Before running into this issue my validator was restarted from a fetched snapshot.

Run number 1 (OOM)

[2024-08-18T02:30:37.271249816Z INFO  solana_validator] Starting validator with: ArgsOs {

[2024-08-18T02:35:59.433139724Z INFO  solana_runtime::snapshot_bank_utils] Rebuilding bank from full snapshot /mnt/incremental-snapshots/snapshot/tmp-snapshot-archive-4bdbjV/snapshots/284238595/284238595 and incremental snapshot Some("/mnt/incremental-snapshots/snapshot/tmp-snapshot-archive-FUcxyJ/snapshots/284262123/284262123")

[2024-08-18T02:40:37.688985990Z INFO  solana_runtime::bank] Verifying bank... Done.
[2024-08-18T02:40:37.689001461Z INFO  solana_metrics::metrics] datapoint: verify_snapshot_bank clean_us=3i shrink_us=2i verify_accounts_us=315i verify_bank_us=1921i
[2024-08-18T02:40:37.707719115Z INFO  solana_metrics::metrics] datapoint: bank_from_snapshot_archives untar_full_snapshot_archive_us=285952460i untar_incremental_snapshot_archive_us=19203001i rebuild_bank_us=278255653i verify_bank_us=2253i
[2024-08-18T02:40:37.781550303Z INFO  solana_metrics::metrics] datapoint: accounts_db_active hash=1i
[2024-08-18T02:40:37.783356194Z INFO  solana_metrics::metrics] datapoint: accounts_db_active hash_scan=1i
[2024-08-18T02:40:37.885030123Z INFO  solana_runtime::accounts_background_service] AccountsBackgroundService has started
[2024-08-18T02:40:37.885002733Z INFO  solana_core::accounts_hash_verifier] AccountsHashVerifier has started
[2024-08-18T02:40:37.885202946Z INFO  solana_metrics::metrics] datapoint: accounts_db_active flush=1i
[2024-08-18T02:40:37.885229757Z INFO  solana_metrics::metrics] datapoint: accounts_db_active flush=0i
[2024-08-18T02:40:37.885235467Z INFO  solana_metrics::metrics] datapoint: accounts_db_active clean=1i

[2024-08-18T02:41:05.596198370Z INFO  solana_runtime::bank_forks] Not sending snapshot request for bank: 284262232, startup verification is incomplete
[2024-08-18T02:41:43.981387128Z INFO  solana_runtime::bank_forks] Not sending snapshot request for bank: 284262335, startup verification is incomplete
[2024-08-18T02:42:23.054685018Z INFO  solana_runtime::bank_forks] Not sending snapshot request for bank: 284262443, startup verification is incomplete
[2024-08-18T02:42:43.944145391Z INFO  solana_runtime::bank] Initial background accounts hash verification has stopped

... accounts cache slowly builds up until OOM

[2024-08-18T03:04:09.622194316Z INFO  solana_metrics::metrics] datapoint: accounts_cache_size num_roots=6735i num_slots=6804i total_unique_writes_size=131617872543i total_size=131616572421i
Aug 18 03:04:15 kernel: Out of memory: Killed process 1605974 (solana-validato) total-vm:845464588kB, anon-rss:255710144kB, file-rss:0kB, shmem-rss:0kB, UID:1001 pgtables:1569680kB oom_score_adj:0

Run number 2 (Recovered) (using slightly newer full snapshot):

[2024-08-18T03:04:30.520681393Z INFO  solana_validator] Starting validator with: ArgsOs {

[2024-08-18T03:09:41.202484698Z INFO  solana_runtime::snapshot_bank_utils] Rebuilding bank from full snapshot /mnt/incremental-snapshots/snapshot/tmp-snapshot-archive-KXomWC/snapshots/284264164/284264164 and incremental snapshot Some("/mnt/incremental-snapshots/snapshot/tmp-snapshot-archive-0tvC0r/snapshots/284272351/284272351")

[2024-08-18T03:14:13.452804047Z INFO  solana_runtime::snapshot_bank_utils] Rebuilding status cache from /mnt/incremental-snapshots/snapshot/tmp-snapshot-archive-0tvC0r/snapshots/status_cache
[2024-08-18T03:14:13.930103479Z INFO  solana_runtime::bank] Verifying bank... Done.
[2024-08-18T03:14:13.930140289Z INFO  solana_metrics::metrics] datapoint: verify_snapshot_bank clean_us=7i shrink_us=2i verify_accounts_us=414i verify_bank_us=2363i
[2024-08-18T03:14:13.948393652Z INFO  solana_metrics::metrics] datapoint: bank_from_snapshot_archives untar_full_snapshot_archive_us=274833173i untar_incremental_snapshot_archive_us=14786261i rebuild_bank_us=272724736i verify_bank_us=2804i
[2024-08-18T03:14:14.018073775Z INFO  solana_metrics::metrics] datapoint: accounts_db_active hash=1i
[2024-08-18T03:14:14.019859396Z INFO  solana_metrics::metrics] datapoint: accounts_db_active hash_scan=1i
[2024-08-18T03:14:14.937976544Z INFO  solana_core::snapshot_packager_service] SnapshotPackagerService has started
[2024-08-18T03:14:14.937992614Z INFO  solana_core::accounts_hash_verifier] AccountsHashVerifier has started
[2024-08-18T03:14:14.938119746Z INFO  solana_runtime::accounts_background_service] AccountsBackgroundService has started
[2024-08-18T03:15:09.944779823Z INFO  solana_metrics::metrics] datapoint: accounts_db_active flush=1i
[2024-08-18T03:15:09.946072246Z INFO  solana_metrics::metrics] datapoint: accounts_db_active flush=0i
[2024-08-18T03:15:09.946084506Z INFO  solana_metrics::metrics] datapoint: accounts_db_active clean=1i

[2024-08-18T03:16:01.976060600Z INFO  solana_runtime::bank_forks] Not sending snapshot request for bank: 284272551, startup verification is incomplete
[2024-08-18T03:16:12.310976267Z INFO  solana_runtime::bank] Initial background accounts hash verification has stopped

.. account cache grows like crazy

[2024-08-18T03:29:19.659891699Z INFO  solana_metrics::metrics] datapoint: accounts_cache_size num_roots=3888i num_slots=3946i total_unique_writes_size=75418807353i total_size=75418109922i
[2024-08-18T03:29:19.988766576Z INFO  solana_metrics::metrics] datapoint: accounts_cache_size num_roots=3889i num_slots=3947i total_unique_writes_size=75437196143i total_size=75436498712i
[2024-08-18T03:29:20.186174140Z INFO  solana_metrics::metrics] datapoint: accounts_cache_size num_roots=3890i num_slots=3948i total_unique_writes_size=75454632853i total_size=75453935422i
[2024-08-18T03:29:20.391397612Z INFO  solana_metrics::metrics] datapoint: accounts_cache_size num_roots=3891i num_slots=3949i total_unique_writes_size=75471005148i total_size=75470307717i
[2024-08-18T03:29:20.596713607Z INFO  solana_metrics::metrics] datapoint: accounts_cache_size num_roots=3892i num_slots=3950i total_unique_writes_size=75484497524i total_size=75483800093i
[2024-08-18T03:29:20.819914108Z INFO  solana_metrics::metrics] datapoint: accounts_cache_size num_roots=3893i num_slots=3951i total_unique_writes_size=75503711819i total_size=75503014388i
[2024-08-18T03:29:21.197054743Z INFO  solana_metrics::metrics] datapoint: accounts_cache_size num_roots=3894i num_slots=3952i total_unique_writes_size=75514456740i total_size=75513759309i
[2024-08-18T03:29:21.422851350Z INFO  solana_metrics::metrics] datapoint: accounts_cache_size num_roots=3895i num_slots=3953i total_unique_writes_size=75539096520i total_size=75538398759i
[2024-08-18T03:29:21.604091216Z INFO  solana_metrics::metrics] datapoint: accounts_cache_size num_roots=3896i num_slots=3954i total_unique_writes_size=75549448873i total_size=75548751112i
[2024-08-18T03:29:21.820010869Z INFO  solana_metrics::metrics] datapoint: accounts_cache_size num_roots=3897i num_slots=3955i total_unique_writes_size=75568727710i total_size=75568029949i
[2024-08-18T03:29:22.089812417Z INFO  solana_metrics::metrics] datapoint: accounts_cache_size num_roots=3898i num_slots=3956i total_unique_writes_size=75578721687i total_size=75578023926i
[2024-08-18T03:29:22.273088440Z INFO  solana_metrics::metrics] datapoint: accounts_cache_size num_roots=3899i num_slots=3957i total_unique_writes_size=75598063811i total_size=75597365052i
[2024-08-18T03:29:22.469498686Z INFO  solana_metrics::metrics] datapoint: accounts_cache_size num_roots=3900i num_slots=3958i total_unique_writes_size=75611102208i total_size=75610403449i
[2024-08-18T03:29:22.660216121Z INFO  solana_metrics::metrics] datapoint: accounts_cache_size num_roots=3901i num_slots=3959i total_unique_writes_size=75636945364i total_size=75636246605i
[2024-08-18T03:29:22.924053044Z INFO  solana_metrics::metrics] datapoint: accounts_cache_size num_roots=3902i num_slots=3960i total_unique_writes_size=75654685847i total_size=75653987088i
[2024-08-18T03:29:23.147595972Z INFO  solana_metrics::metrics] datapoint: accounts_cache_size num_roots=3903i num_slots=3961i total_unique_writes_size=75680395902i total_size=75679697143i
[2024-08-18T03:29:23.346844447Z INFO  solana_metrics::metrics] datapoint: accounts_cache_size num_roots=3904i num_slots=3962i total_unique_writes_size=75684512172i total_size=75683813413i
[2024-08-18T03:29:23.657677435Z INFO  solana_metrics::metrics] datapoint: accounts_cache_size num_roots=3905i num_slots=3963i total_unique_writes_size=75703629613i total_size=75702930854i
[2024-08-18T03:29:23.850947264Z INFO  solana_metrics::metrics] datapoint: accounts_cache_size num_roots=3906i num_slots=3964i total_unique_writes_size=75721570991i total_size=75720870904i
[2024-08-18T03:29:24.098776463Z INFO  solana_metrics::metrics] datapoint: accounts_cache_size num_roots=3907i num_slots=3941i total_unique_writes_size=75366780696i total_size=75366082605i
[2024-08-18T03:29:24.317991264Z INFO  solana_metrics::metrics] datapoint: accounts_cache_size num_roots=3908i num_slots=3940i total_unique_writes_size=75353971172i total_size=75353272083i

[2024-08-18T03:29:18.239257454Z INFO  solana_metrics::metrics] datapoint: accounts_db_active clean=0i
[2024-08-18T03:29:18.239441817Z INFO  solana_metrics::metrics] datapoint: accounts_db_active squash_ancient=1i
[2024-08-18T03:29:23.212064226Z INFO  solana_metrics::metrics] datapoint: accounts_db_active squash_ancient=0i
[2024-08-18T03:29:23.847223588Z INFO  solana_metrics::metrics] datapoint: accounts_background_service duration_since_previous_submit_us=908909067i num_iterations=1i cumulative_runtime_us=908909067i mean_runtime_us=908909067i min_runtime_us=908909067i max_runtime_us=908909067i
[2024-08-18T03:29:24.317991264Z INFO  solana_metrics::metrics] datapoint: accounts_cache_size num_roots=3908i num_slots=3940i total_unique_writes_size=75353971172i total_size=75353272083i
[2024-08-18T03:29:24.541575391Z INFO  solana_metrics::metrics] datapoint: handle_snapshot_requests num_outstanding_requests=37i num_re_enqueued_requests=0i enqueued_time_us=2257651i

.. accounts cache size then drops dramatically

[2024-08-18T03:29:24.558306729Z INFO  solana_metrics::metrics] datapoint: accounts_cache_size num_roots=9i num_slots=3941i total_unique_writes_size=75373470282i total_size=75372767201i
[2024-08-18T03:29:24.829340939Z INFO  solana_metrics::metrics] datapoint: accounts_cache_size num_roots=10i num_slots=3929i total_unique_writes_size=75134872352i total_size=75149962239i
[2024-08-18T03:29:25.092436658Z INFO  solana_metrics::metrics] datapoint: accounts_cache_size num_roots=11i num_slots=3917i total_unique_writes_size=74905491373i total_size=74904788284i
[2024-08-18T03:29:25.344323838Z INFO  solana_metrics::metrics] datapoint: accounts_cache_size num_roots=12i num_slots=3907i total_unique_writes_size=74701544727i total_size=74700845630i
[2024-08-18T03:29:25.642067673Z INFO  solana_metrics::metrics] datapoint: accounts_cache_size num_roots=13i num_slots=3892i total_unique_writes_size=74403490578i total_size=74402791481i
[2024-08-18T03:29:25.804533246Z INFO  solana_metrics::metrics] datapoint: accounts_cache_size num_roots=14i num_slots=3886i total_unique_writes_size=74300242146i total_size=74299543049i
[2024-08-18T03:29:25.973037426Z INFO  solana_metrics::metrics] datapoint: accounts_cache_size num_roots=15i num_slots=3880i total_unique_writes_size=74162444149i total_size=74161746380i
[2024-08-18T03:29:26.152239087Z INFO  solana_metrics::metrics] datapoint: accounts_cache_size num_roots=16i num_slots=3873i total_unique_writes_size=74041525386i total_size=74040827452i
[2024-08-18T03:29:26.532439805Z INFO  solana_metrics::metrics] datapoint: accounts_cache_size num_roots=17i num_slots=3858i total_unique_writes_size=73728458383i total_size=73727763938i
[2024-08-18T03:29:26.723768210Z INFO  solana_metrics::metrics] datapoint: accounts_cache_size num_roots=18i num_slots=3851i total_unique_writes_size=73575647587i total_size=73574961126i
[2024-08-18T03:29:26.919120487Z INFO  solana_metrics::metrics] datapoint: accounts_cache_size num_roots=19i num_slots=3843i total_unique_writes_size=73435963882i total_size=73435278254i
[2024-08-18T03:29:27.159797198Z INFO  solana_metrics::metrics] datapoint: accounts_cache_size num_roots=20i num_slots=3831i total_unique_writes_size=73216497350i total_size=73215812720i
[2024-08-18T03:29:27.421965211Z INFO  solana_metrics::metrics] datapoint: accounts_cache_size num_roots=21i num_slots=3820i total_unique_writes_size=73018707056i total_size=73018023919i
[2024-08-18T03:29:27.638219959Z INFO  solana_metrics::metrics] datapoint: accounts_cache_size num_roots=22i num_slots=3812i total_unique_writes_size=72840005981i total_size=72839322176i
[2024-08-18T03:29:27.891561425Z INFO  solana_metrics::metrics] datapoint: accounts_cache_size num_roots=23i num_slots=3800i total_unique_writes_size=72602197688i total_size=72601515211i
[2024-08-18T03:29:28.121210130Z INFO  solana_metrics::metrics] datapoint: accounts_cache_size num_roots=24i num_slots=3789i total_unique_writes_size=72404523890i total_size=72403844407i
[2024-08-18T03:29:28.381992569Z INFO  solana_metrics::metrics] datapoint: accounts_cache_size num_roots=25i num_slots=3777i total_unique_writes_size=72166315876i total_size=72165637556i
[2024-08-18T03:29:28.603866856Z INFO  solana_metrics::metrics] datapoint: accounts_cache_size num_roots=26i num_slots=3768i total_unique_writes_size=72003070699i total_size=72002392709i
[2024-08-18T03:29:28.783105337Z INFO  solana_metrics::metrics] datapoint: accounts_cache_size num_roots=27i num_slots=3760i total_unique_writes_size=71874873725i total_size=71874196733i
[2024-08-18T03:29:28.949846846Z INFO  solana_metrics::metrics] datapoint: accounts_cache_size num_roots=28i num_slots=3753i total_unique_writes_size=71761167161i total_size=71760489839i

[2024-08-18T03:31:13.866343983Z INFO  solana_runtime::snapshot_bank_utils] Creating bank snapshot for slot 284276319, path: /mnt/incremental-snapshots/snapshot/284276319/284276319.pre
jstarry commented 3 weeks ago

@brooksprumo is this a known issue? Looks like on latest v1.18 after startup we are cleaning accounts db for a long time which blocks any snapshot handling and causes the accounts cache memory usage to steadily climb until the cleaning is finished. I've never looked into accounts cleaning before and didn't see logs about it so not sure what's happening internally, can you shed some light as to what may be happening?

I determined the call site to be https://github.com/anza-xyz/agave/blob/34409e70a69eecc30dd15977d7d9dd1456c5b5b3/runtime/src/accounts_background_service.rs#L708 due to the long duration of the datapoint: accounts_background_service log in my comment above.

brooksprumo commented 3 weeks ago

The snapshot handling isn't the issue, it's that we aren't flushing the accounts write cache. The AccountsBackgroundService loop does flush, clean, then shrink. The first clean has to do a lot more work than normal/steady state operation, which blocks flush.

That first clean can take a long time -- long enough where the number of slots in the accounts write cache can trigger an OOM. I've mostly only seen reports of this on underspec'd machines (cpu/ram/disk).

Changes are in master (as of a few days ago) to help ram usage in clean. But that doesn't affect ram usage of the accounts write cache while flush is blocked. If clean runs faster, it will help though. More work is going into improving clean too.

Back in v1.16 we added a hidden cli arg to workaround this issue: --no-skip-initial-accounts-db-clean. This runs clean before we begin the normal validator operations. Catchup will take longer since the node will start replay later.

Can you send me (DM or here) the pubkey of the node so I can look at the metrics?

diman-io commented 3 weeks ago

I think the topic starter started with the bank, not the archive. I might be wrong, but it seems like the no-skip key doesn’t apply in this case.

upd. Sorry for the noise. I’m on my phone. After carefully rereading the first message, it’s already mentioned there.

diman-io commented 3 weeks ago

What I want to say is that this is precisely one of those cases of a coincidence of circumstances:

  1. No one usually removes flags from the config that once helped solve a problem, so the flag is identified as "good". When using the scheme with validator switching, the additional 30 minutes don’t matter. So --no-skip-... will be in the config.
  2. No one usually adds new parameters to the launch configuration, relying on the developers to ensure that the default parameters are good. So --use-snapshot-archives-... won’t be in the config.
  3. solana-validator exit is used for the restart...
jstarry commented 3 weeks ago

The snapshot handling isn't the issue, it's that we aren't flushing the accounts write cache. The AccountsBackgroundService loop does flush, clean, then shrink. The first clean has to do a lot more work than normal/steady state operation, which blocks flush.

Got it, makes sense. My mistake for conflating snapshot creation with the flush.

That first clean can take a long time -- long enough where the number of slots in the accounts write cache can trigger an OOM. I've mostly only seen reports of this on underspec'd machines (cpu/ram/disk).

My node is running with an Epyc 7443P, 256GB of RAM, and separate nvme disks for ledger and accounts/snapshots

Changes are in master (as of a few days ago) to help ram usage in clean. But that doesn't affect ram usage of the accounts write cache while flush is blocked. If clean runs faster, it will help though. More work is going into improving clean too.

Yeah improving clean speed is one way to fix the issue. Being able to run flushes during a clean would be nice but I have no idea how accounts db works :)

Back in v1.16 we added a hidden cli arg to workaround this issue: --no-skip-initial-accounts-db-clean. This runs clean before we begin the normal validator operations. Catchup will take longer since the node will start replay later.

I'd actually prefer faster startup with the risk of OOM. But good to know this arg can help when the oom issue arises.

Can you send me (DM or here) the pubkey of the node so I can look at the metrics?

Sorry don't report metrics. Anything in particular you want to look into?

brooksprumo commented 3 weeks ago

Being able to run flushes during a clean would be nice but I have no idea how accounts db works :)

Currently that's not possible, but I imagine we could make it work.

Sorry don't report metrics. Anything in particular you want to look into?

I was going to see if anything else jumped out as "off". I have a dashboard with about 100 cells and wanted to compare your node with another mnb node. Picking out particular metrics a few at a time likely won't be all that useful.