datafuselabs / openraft

rust raft with improvements
Apache License 2.0
1.36k stars 153 forks source link

panicked at 'assertion failed: `(left == right)` when adding learner #471

Closed ppamorim closed 2 years ago

ppamorim commented 2 years ago

When calling app.raft.add_learner(node_id, Some(node), true) the application crashes and a error happens.

Payload:

POST - /add-learner [8117459876464397197, "0.0.0.0:8101"]

Error:

thread 'actix-server worker 1' panicked at 'assertion failed: `(left == right)`
  left: `2`,
 right: `0`: zero-th log entry must be (0,0,0), but 0-0 2', /Users/ppamorim/.cargo/git/checkouts/openraft-0677bbb78cc57964/db047b9/openraft/src/raft_types.rs:66:13

0: rust_begin_unwind
             at /rustc/fe5b13d681f25ee6474be29d748c65adcd91f69e/library/std/src/panicking.rs:584:5
   1: core::panicking::panic_fmt
             at /rustc/fe5b13d681f25ee6474be29d748c65adcd91f69e/library/core/src/panicking.rs:143:14
   2: core::panicking::assert_failed_inner
   3: core::panicking::assert_failed
   4: openraft::raft_types::LogId<NID>::new
   5: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
   6: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
   7: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
   8: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
   9: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
  10: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
  11: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
  12: sentry_core::hub::Hub::run
  13: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
  14: <actix_web::middleware::logger::LoggerResponse<S,B> as core::future::future::Future>::poll
  15: <actix_web::middleware::compress::CompressResponse<S,B> as core::future::future::Future>::poll
  16: actix_http::h1::dispatcher::InnerDispatcher<T,S,B,X,U>::poll_response
  17: <actix_http::h1::dispatcher::Dispatcher<T,S,B,X,U> as core::future::future::Future>::poll
  18: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
  19: tokio::runtime::task::harness::Harness<T,S>::poll
  20: std::thread::local::LocalKey<T>::with
  21: tokio::task::local::LocalSet::tick
  22: tokio::macros::scoped_tls::ScopedKey<T>::set
  23: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
  24: std::thread::local::LocalKey<T>::with
  25: tokio::runtime::basic_scheduler::Context::enter
  26: tokio::macros::scoped_tls::ScopedKey<T>::set
  27: tokio::runtime::basic_scheduler::BasicScheduler::block_on
  28: tokio::runtime::Runtime::block_on

Full log: https://gist.github.com/ppamorim/520c04892db587790e07cadf2b30c6e4

Openraft is in sync with the main branch.

EDIT 1:

It seems to not be deterministic, if I call this a couple of seconds after the nodes start and attempt multiple times, the error doesn't happen.

This bug is visible on this branch: https://shorturl.at/cgOU0 (shorted to prevent indexing by search engines)

To reproduce: run ./script/start-cluster.sh and check n0.log.

EDIT 2:

By adding a loop inside the request to add the learner, it seems to sort the issue but it's a workaround. Feels like the version of openraft on the main branch is not ready after its initial setup. Interesting.

As:

pub async fn add_learner(
  opt: &Opt,
  client: &reqwest::Client,
  id: &CognitoNodeId,
) -> Result<Response, Error> {
  let url = format!(
    "http://{}/add-learner",
    opt
      .leader_addr
      .clone()
      .unwrap()
  );
  let ip = if let Some(ip_adddr) = opt.ip_addr.clone() {
    ip_adddr
  } else {
    opt.http_addr.clone()
  };
  let json: (&CognitoNodeId, String) = (id, ip);
  let mut attempts = opt.expected_nodes_count;
  loop {
    match client
      .post(&url)
      .json(&json)
      .send()
      .await
    {
      Ok(response) => break Ok(response),
      Err(e) => {
        if {
          println!("Error: {}", e);
          attempts -= 1;
          attempts
        } < 100 {
          break Err(e);
        }
      }
    }
  }
}
github-actions[bot] commented 2 years ago

👋 Thanks for opening this issue!

Get help or engage by:

drmingdrmer commented 2 years ago

I believe the panic happens here: https://github.com/datafuselabs/openraft/blob/d62183517be4b6e38360cc107d10f2488123641b/openraft/src/raft.rs#L439

This is to check if replication to the newly added learner becomes up to date. But here it tries to use matched.leader_id to build a LogId, while matched is an uninitialized value (0,0). In such a case, it should avoid creating a LogId but just compare the index instead.

I will remove this LogId::new() to fix this issue.

Thank you man! 😆