eBay / nuraft_mesg

A session management and gRPC service layer for NuRaft.
Apache License 2.0
6 stars 11 forks source link

TIMEOUT error reported by data_service_request_bidirectional call (no flip, no restarts, pure I/O) #87

Open yamingk opened 1 month ago

yamingk commented 1 month ago

After running I/O for awhile, TIMEOUT error can be reported by data_service_request_bidirectional while fetching data, this is not expected as there is no flip, no restarts involved in any replica, just pure I/O:

This is seen with real drives, not file.

[05/15/24 13:38:29-07:00] [E] [test_raft_repl_dev_replica_2] [1574] [raft_repl_dev.cpp:515] [rd=rdev1] Not able to fetching data from originator=1251922335, error=-2, probably originator is down. Will retry when new leader start appending log entries
[05/15/24 13:38:35-07:00] [E] [test_raft_repl_dev_replica_2] [1626] [raft_repl_dev.cpp:692] [rd=rdev1] Raft Channel: Error in processing rreq=[repl_key=[server=1251922335, term=1, dsn=64118], lsn=-1 state=[BLK_ALLOCATED | ] m_headersize=0 m_keysize=0 is_proposer=false local_blkid=[{blk#=917505 count=2 chunk=181},] remote_blkid=[{blk#=737284 count=2 chunk=619},]] error=TIMEOUT
[05/15/24 13:38:35-07:00] [E] [test_raft_repl_dev_replica_2] [1626] [raft_repl_dev.cpp:692] [rd=rdev1] Raft Channel: Error in processing rreq=[repl_key=[server=1251922335, term=1, dsn=64119], lsn=-1 state=[BLK_ALLOCATED | ] m_headersize=0 m_keysize=0 is_proposer=false local_blkid=[{blk#=425986 count=2 chunk=182},] remote_blkid=[{blk#=1032193 count=2 chunk=620},]] error=TIMEOUT
[05/15/24 13:38:35-07:00] [E] [test_raft_repl_dev_replica_2] [1626] [raft_repl_dev.cpp:692] [rd=rdev1] Raft Channel: Error in processing rreq=[repl_key=[server=1251922335, term=1, dsn=64120], lsn=-1 state=[BLK_ALLOCATED | ] m_headersize=0 m_keysize=0 is_proposer=false local_blkid=[{blk#=671751 count=4 chunk=183},] remote_blkid=[{blk#=737281 count=4 chunk=621},]] error=TIMEOUT
[05/15/24 13:38:35-07:00] [E] [test_raft_repl_dev_replica_2] [1626] [raft_repl_dev.cpp:692] [rd=rdev1] Raft Channel: Error in processing rreq=[repl_key=[server=1251922335, term=1, dsn=64123], lsn=-1 state=[BLK_ALLOCATED | ] m_headersize=0 m_keysize=0 is_proposer=false local_blkid=[{blk#=327683 count=1 chunk=184},] remote_blkid=[{blk#=753666 count=1 chunk=624},]] error=TIMEOUT
[05/15/24 13:38:35-07:00] [E] [test_raft_repl_dev_replica_2] [1626] [raft_repl_dev.cpp:692] [rd=rdev1] Raft Channel: Error in processing rreq=[repl_key=[server=1251922335, term=1, dsn=64121], lsn=-1 state=[BLK_ALLOCATED | ] m_headersize=0 m_keysize=0 is_proposer=false local_blkid=[{blk#=704515 count=5 chunk=185},] remote_blkid=[{blk#=376838 count=5 chunk=622},]] error=TIMEOUT
szmyd commented 1 month ago

Why is this opened in the nuraft_mesg project and not homestore which is where the issue seems to be according to the logs? If there's some raft issue; then we need raft logs...not homestore logs which would indicate a bug on the library user, not the library itself. Not saying it is, but that's the only thing left here to look at...

yamingk commented 1 month ago

It is return immediately at line:506. And we need to look into why mesg_factory::data_service_request_bidirectional is returning -2 when there is no reboot in between. We don't know where the root cause is yet without deeper investigation. It could be in upper layer or lower layer caused this problem.

 505     group_msg_service()
 506         ->data_service_request_bidirectional(
 507             originator, FETCH_DATA,
 508             sisl::io_blob_list_t{
 509                 sisl::io_blob{builder->GetBufferPointer(), builder->GetSize(), false /* is_aligned */}})
 510         .via(&folly::InlineExecutor::instance())
 511         .thenValue([this, builder, rreqs = std::move(rreqs)](auto response) {
 512             if (!response) {
 513                 // if we are here, it means the original who sent the log entries are down.
 514                 // we need to handle error and when the other member becomes leader, it will resend the log entries;
 515                 RD_LOG(ERROR,
 516                        "Not able to fetching data from originator={}, error={}, probably originator is down. Will "