anza-xyz / agave

Web-Scale Blockchain for fast, secure, scalable, decentralized apps and marketplaces.
https://www.anza.xyz/
Apache License 2.0
437 stars 212 forks source link

Network crash at 4 billion active stake milestone #3691

Open joseluis-varo-wsg opened 1 day ago

joseluis-varo-wsg commented 1 day ago

Problem

Hello,

We've deployed a net from scratch, with 4 validator nodes. We started out at 5 sols per node, and delegated a billion sols after. Recently, we reached 4 billion sols active stake (for comparison, solana mainnet has yet to reach 400 million as of writing), and our net crashed. We can't boot it back up either, our last snapshot is shortly before the epoch change and subsequent 4 billion stake milestone.

This is our error log:

[2024-11-14T01:26:25.974747961Z INFO  solana_runtime::bank] new epoch stakes, epoch: 119, total_stake: 4000000019999973760
thread 'thread 'thread '<unnamed><unnamed><unnamed>' panicked at ' panicked at ' panicked at /opt/bins/solana/1.17.27/solana-1.17.27/programs/stake/src/stake_state.rs/opt/bins/solana/1.17.27/solana-1.17.27/programs/stake/src/stake_state.rs/opt/bins/solana/1.17.27/solana-1.17.27/programs/stake/src/stake_state.rs:::436436436:::101010:
:
:
called `Option::unwrap()` on a `None` valuecalled `Option::unwrap()` on a `None` valuecalled `Option::unwrap()` on a `None` value

stack backtrace:
[2024-11-14T01:26:26.057024521Z INFO  solana_metrics::metrics] datapoint: tpu-verifier recv_batches_us_90pct=0i recv_batches_us_min=0i recv_batches_us_max=0i recv_batches_us_mean=0i verify_batches_pp_us_90pct=0i verify_batches_pp_us_min=0i verify_batches_pp_us_max=0i verify_batches_pp_us_mean=0i discard_packets_pp_us_90pct=0i discard_packets_pp_us_min=0i discard_packets_pp_us_max=0i discard_packets_pp_us_mean=0i dedup_packets_pp_us_90pct=0i dedup_packets_pp_us_min=0i dedup_packets_pp_us_max=0i dedup_packets_pp_us_mean=0i batches_90pct=0i batches_min=0i batches_max=0i batches_mean=0i packets_90pct=0i packets_min=0i packets_max=0i packets_mean=0i num_deduper_saturations=0i total_batches=0i total_packets=0i total_dedup=0i total_excess_fail=0i total_valid_packets=0i total_discard_random=0i total_shrinks=0i total_dedup_time_us=0i total_discard_time_us=0i total_discard_random_time_us=0i total_verify_time_us=0i total_shrink_time_us=0i
[2024-11-14T01:26:26.104300166Z INFO  solana_metrics::metrics] datapoint: connection_cache_tpu_quic cache_hits=2i cache_misses=0i cache_evictions=0i eviction_time_ms=0i get_connection_ms=0i get_connection_lock_ms=0i get_connection_hit_ms=0i get_connection_miss_ms=0i make_connection_ms=0i total_connections=0i connection_reuse=2i connection_errors=0i zero_rtt_accepts=0i zero_rtt_rejects=0i congestion_events=0i streams_blocked_uni=0i data_blocked=0i acks=0i num_packets=2i total_batches=2i batch_failure=0i send_timeout=0i average_send_packet_us=0i successful_packets=0i average_prepare_connection_us=0i
[2024-11-14T01:26:26.229872130Z INFO  solana_metrics::metrics] datapoint: serve_repair-requests_received total_requests=0i dropped_requests_outbound_bandwidth=0i dropped_requests_load_shed=0i dropped_requests_low_stake=0i whitelisted_requests=0i total_dropped_response_packets=0i handle_requests_staked=0i handle_requests_unstaked=0i processed=0i total_response_packets=0i total_response_bytes_staked=0i total_response_bytes_unstaked=0i self_repair=0i window_index=0i request-highest-window-index=0i orphan=0i serve_repair-request-ancestor-hashes=0i pong=0i window_index_misses=0i ping_cache_check_failed=0i pings_sent=0i decode_time_us=0i handle_requests_time_us=0i err_time_skew=0i err_malformed=0i err_sig_verify=0i err_unsigned=0i err_id_mismatch=0i
   0: rust_begin_unwind
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:595:5
   1: core::panicking::panic_fmt
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/panicking.rs:67:14
   2: core::panicking::panic
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/panicking.rs:117:5
   3: solana_stake_program::stake_state::redeem_rewards
[2024-11-14T01:26:26.309572830Z INFO  solana_metrics::metrics] datapoint: ancestor_hashes_responses total_packets=0i processed=0i dropped_packets=0i invalid_packets=0i ping_count=0i ping_err_verify_count=0i
   4: solana_runtime::bank::Bank::redeem_rewards::{{closure}}::{{closure}}
   5: rayon::iter::plumbing::Folder::consume_iter
   6: rayon::iter::plumbing::bridge_producer_consumer::helper
[2024-11-14T01:26:26.334952317Z INFO  solana_core::window_service] num addresses: 0, top packets by source: []
[2024-11-14T01:26:26.334956735Z INFO  solana_metrics::metrics] datapoint: blockstore-insert-shreds num_shreds=0i total_elapsed_us=0i insert_lock_elapsed_us=0i insert_shreds_elapsed_us=0i shred_recovery_elapsed_us=0i chaining_elapsed_us=0i commit_working_sets_elapsed_us=0i write_batch_elapsed_us=0i num_inserted=0i num_repair=0i num_recovered=0i num_recovered_inserted=0i num_recovered_failed_sig=0i num_recovered_failed_invalid=0i num_recovered_exists=0i num_recovered_blockstore_error=0i num_repaired_data_shreds_exists=0i num_turbine_data_shreds_exists=0i num_data_shreds_invalid=0i num_data_shreds_blockstore_error=0i num_coding_shreds_exists=0i num_coding_shreds_invalid=0i num_coding_shreds_invalid_erasure_config=0i num_coding_shreds_inserted=0i
[2024-11-14T01:26:26.334964019Z INFO  solana_metrics::metrics] datapoint: recv-window-insert-shreds handle_packets_elapsed_us=0i run_insert_count=0i num_packets=0i num_repairs=0i num_shreds_received=0i shred_receiver_elapsed_us=0i prune_shreds_elapsed_us=0i num_shreds_pruned_invalid_repair=0i num_errors=15i num_errors_blockstore=0i num_errors_other=0i num_errors_try_crossbeam_send=0i num_errors_cross_beam_recv_timeout=15i
   7: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
   8: rayon_core::registry::WorkerThread::wait_until_cold
   9: rayon_core::registry::ThreadBuilder::run
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
stack backtrace:
   0: rust_begin_unwind
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:595:5
   1: core::panicking::panic_fmt
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/panicking.rs:67:14
   2: core::panicking::panic
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/panicking.rs:117:5
   3: solana_stake_program::stake_state::redeem_rewards
   4: solana_runtime::bank::Bank::redeem_rewards::{{closure}}::{{closure}}
   5: rayon::iter::plumbing::Folder::consume_iter
   6: rayon::iter::plumbing::bridge_producer_consumer::helper
   7: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
   8: rayon_core::registry::WorkerThread::wait_until_cold
[2024-11-14T01:26:26.396545175Z ERROR solana_metrics::metrics] datapoint: panic program="validator" thread="?" one=1i message="panicked at /opt/bins/solana/1.17.27/solana-1.17.27/programs/stake/src/stake_state.rs:436:10:
    called `Option::unwrap()` on a `None` value" location="/opt/bins/solana/1.17.27/solana-1.17.27/programs/stake/src/stake_state.rs:436:10" version="1.17.27 (src:00000000; feat:3746964731, client:SolanaLabs)"
   9: rayon_core::join::join_context::{{closure}}
  10: rayon_core::registry::in_worker
  11: rayon::iter::plumbing::bridge_producer_consumer::helper
[2024-11-14T01:26:26.424312324Z INFO  solana_metrics::metrics] datapoint: memory-stats total=16768888832i swap_total=2148528128i free_percent=1.6338187386464036 used_bytes=9155133440i avail_percent=45.40405430722817 buffers_percent=0.4778983318624698 cached_percent=38.08877795057947 swap_free_percent=100
  12: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
  13: rayon_core::registry::WorkerThread::wait_until_cold
  14: rayon_core::registry::ThreadBuilder::run
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
stack backtrace:
[2024-11-14T01:26:26.452182033Z ERROR solana_metrics::metrics] datapoint: panic program="validator" thread="?" one=1i message="panicked at /opt/bins/solana/1.17.27/solana-1.17.27/programs/stake/src/stake_state.rs:436:10:
    called `Option::unwrap()` on a `None` value" location="/opt/bins/solana/1.17.27/solana-1.17.27/programs/stake/src/stake_state.rs:436:10" version="1.17.27 (src:00000000; feat:3746964731, client:SolanaLabs)"
   0: rust_begin_unwind
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:595:5
   1: core::panicking::panic_fmt
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/panicking.rs:67:14
   2: core::panicking::panic
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/panicking.rs:117:5
   3: solana_stake_program::stake_state::redeem_rewards
   4: solana_runtime::bank::Bank::redeem_rewards::{{closure}}::{{closure}}
   5: rayon::iter::plumbing::Folder::consume_iter
   6: rayon::iter::plumbing::bridge_producer_consumer::helper
   7: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
   8: rayon_core::registry::WorkerThread::wait_until_cold
   9: rayon_core::registry::ThreadBuilder::run
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
[2024-11-14T01:26:26.452475536Z ERROR solana_metrics::metrics] datapoint: panic program="validator" thread="?" one=1i message="panicked at /opt/bins/solana/1.17.27/solana-1.17.27/programs/stake/src/stake_state.rs:436:10:
    called `Option::unwrap()` on a `None` value" location="/opt/bins/solana/1.17.27/solana-1.17.27/programs/stake/src/stake_state.rs:436:10" version="1.17.27 (src:00000000; feat:3746964731, client:SolanaLabs)"

Since we're seeing a None there right after a checked_mul() and are dealing with ...slightly oversized numbers, we're suspecting an integer overflow here. Then again, u128 should leave ample room to work with on that multiplication, so we're unsure how it could possibly overflow.

Software version is v1.17.34.

We had previously tracked the affected line down expecting a fix or something in the latest versions, so for convenience's sake, here it is: It got moved to rewards.rs following a refactor, albeit with little change to the underlying logic, and we haven't seen any indication that this behavior should vary across versions.

(Note: reopening this here after realizing the old repo is no longer in use, so the links above won't quite match)

Proposed Solution

None, I'm afraid.

casehub2 commented 9 hours ago

Contact the customer support to resolve your error. Note: You can also initiate a chat with the live agent on the chat button to get more information about your request via Live Chat @joseluis-varo-wsg Support.