casper-network / casper-node

Reference client for CASPER protocol
https://casper.network
Apache License 2.0
393 stars 223 forks source link

node panicked: Failed to download linear chain. #693

Closed matsuro-hadouken closed 3 years ago

matsuro-hadouken commented 3 years ago

After recent crash and no success with join it again I try to run validator from scratch.

Service exit error:

Main process exited, code=killed, status=6/ABRT

Valdator log is overloaded and produced about 4G since last 30 minutes, mostly:

message received","msg":"payload: AddressGossiper::(gossip-response message received","msg":"payload: AddressGossiper::(gossip(gossiped-address

Warnings present in a log:

"WARN","fields":{"message":"Finality signatures not handled in joiner reactor"},"target":"casper_node::reactor::joiner","span":{"ev":7206551,"name":"dispatch events"},"spans":[{"ev":7206551,"name":"crank"},{"ev":7206551,"name":"dispatch events"}]}

"WARN","fields":{"message":"network announcement ignored.","other":"Consensus(Protocol { era_id.0: 214, .. })"},"target":"casper_node::reactor::joiner","span":{"ev":7239826,"name":"dispatch events"},"spans":[{"ev":7239826,"name":"crank"},{"ev":7239826,"name":"dispatch events"}]}

Most of the warning spam is:

"WARN","fields":{"message":"NodeId::Tls(7e9d..fa72): outgoing connection failed","peer_address":

Just two warnings about:

{"timestamp":"Dec 26 14:23:37.920","level":"WARN","fields":{"message":"large event size, consider reducing it or boxing","event_size":"160"},"target":"casper_node::reactor"}

{"timestamp":"Dec 26 14:23:38.245","level":"WARN","fields":{"message":"large event size, consider reducing it or boxing","event_size":"544"},"target":"casper_node::reactor"}

   0: casper_node::panic_hook
   1: core::ops::function::Fn::call
   2: std::panicking::rust_panic_with_hook
             at rustc/25f6938da459a57b43bdf16ed6bdad3225b2a3ce/library/std/src/panicking.rs:597:17
   3: std::panicking::begin_panic::{{closure}}
   4: std::sys_common::backtrace::__rust_end_short_backtrace
   5: std::panicking::begin_panic
   6: <casper_node::components::linear_chain_sync::LinearChainSync<I> as casper_node::components::Component<REv>>::handle_event
   7: <casper_node::reactor::joiner::Reactor as casper_node::reactor::Reactor>::dispatch_event
   8: casper_node::reactor::Runner<R>::crank::{{closure}}
   9: casper_node::cli::Cli::run::{{closure}}
  10: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
  11: std::thread::local::LocalKey<T>::with
  12: tokio::runtime::enter::Enter::block_on
  13: tokio::runtime::thread_pool::ThreadPool::block_on
  14: tokio::runtime::context::enter
  15: tokio::runtime::handle::Handle::enter
  16: casper_node::main
  17: std::sys_common::backtrace::__rust_begin_short_backtrace
  18: std::rt::lang_start::{{closure}}
  19: core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once
             at rustc/25f6938da459a57b43bdf16ed6bdad3225b2a3ce/library/core/src/ops/function.rs:259:13
      std::panicking::try::do_call
             at rustc/25f6938da459a57b43bdf16ed6bdad3225b2a3ce/library/std/src/panicking.rs:381:40
      std::panicking::try
             at rustc/25f6938da459a57b43bdf16ed6bdad3225b2a3ce/library/std/src/panicking.rs:345:19
      std::panic::catch_unwind
             at rustc/25f6938da459a57b43bdf16ed6bdad3225b2a3ce/library/std/src/panic.rs:396:14
      std::rt::lang_start_internal
             at rustc/25f6938da459a57b43bdf16ed6bdad3225b2a3ce/library/std/src/rt.rs:51:25
  20: main
  21: __libc_start_main
  22: _start

node panicked: Failed to download linear chain.
goral09 commented 3 years ago

Thank you for submitting the issue.

I had a quick glance at the logs you provided and the code and I have some initial theory:

It looks like your node runs out of peers before downloading a block with a specific hash.

tl;dr; description of the algorithm is:

It will first get a hold of a block's hash (it's either the trusted hash you configure it with or some other block it learned about when fetching the chain of headers first) and sometime later it tries to download that block. It will ask all peers one-by-one and if it runs out of the peers before downloading the block it panics.

Given that there's a lot of

"WARN","fields":{"message":"NodeId::Tls(7e9d..fa72): outgoing connection failed","peer_address":

those two issues might be related – i.e. the former is caused by the latter.

MParlikar commented 3 years ago

Fixed. Closing