dusk-network / rusk

The reference Dusk platform implementation and tools
Mozilla Public License 2.0
163 stars 60 forks source link

node: Handle a race condition between `request for a block` and `block acceptance` procedure. #1722

Closed goshawk-3 closed 5 months ago

goshawk-3 commented 6 months ago

Describe the bug It has been discovered that there could be a timing conflict between the flood_request handling and block acceptance processes. Essentially, while a node is handling a GetResource(a.k.a GetData) request for a block A, it might simultaneously be in the process of accepting the same block A, resulting in the failure of the GetResource request.

To Reproduce The scenario has been reproduced multiple times in localnet where network latency is zero however this is supposed to be a valid scenario in testnet network especially if EST(accept/finalize) call takes longer.

Expected behaviour If a node knows the requested block (e.g it may be stored in Candidates CF ), it should be provided. The sooner it's provided, the more synchronized the network will become.

Logs/Screenshot

/// NODE_4 broadcasts the request:

{"timestamp":"2024-05-09T09:06:52.456742Z","level":"INFO","event":"flood_request block","req":"BlockRequest::ByHash(196b889f51ceca54...d78c685493de136c)","target":"node::chain::fsm"}

/// NODE_3 could not find the block

{"timestamp":"2024-05-09T09:06:52.458299Z","level":"INFO","message":"handle_get_resource called max_entries =100, fetch_block GetResource { inventory: Inv { inv_list: [InvVect { inv_type: BlockFromHash, param: Hash([25, 107, 136, 159, 81, 206, 202, 84, 95, 144, 86, 146, 87, 90, 131, 254, 19, 219, 6, 10, 141, 169, 241, 118, 215, 140, 104, 84, 147, 222, 19, 108]) }] }, requester_addr: 127.0.0.1:7004, ttl_as_sec: 1715245622 }","target":"node::databroker"}
{"timestamp":"2024-05-09T09:06:52.458331Z","level":"INFO","message":"handle_get_resource couldn't found","target":"node::databroker"}

/// NODE_2 could not find the block

{"timestamp":"2024-05-09T09:06:52.458724Z","level":"INFO","message":"handle_get_resource called max_entries =100, fetch_block GetResource { inventory: Inv { inv_list: [InvVect { inv_type: BlockFromHash, param: Hash([25, 107, 136, 159, 81, 206, 202, 84, 95, 144, 86, 146, 87, 90, 131, 254, 19, 219, 6, 10, 141, 169, 241, 118, 215, 140, 104, 84, 147, 222, 19, 108]) }] }, requester_addr: 127.0.0.1:7004, ttl_as_sec: 1715245622 }","target":"node::databroker"}
{"timestamp":"2024-05-09T09:06:52.458747Z","level":"INFO","message":"handle_get_resource couldn't found","target":"node::databroker"}

/// NODE_1 could not find the block

{"timestamp":"2024-05-09T09:06:52.459559Z","level":"INFO","message":"handle_get_resource called max_entries =100, fetch_block GetResource { inventory: Inv { inv_list: [InvVect { inv_type: BlockFromHash, param: Hash([25, 107, 136, 159, 81, 206, 202, 84, 95, 144, 86, 146, 87, 90, 131, 254, 19, 219, 6, 10, 141, 169, 241, 118, 215, 140, 104, 84, 147, 222, 19, 108]) }] }, requester_addr: 127.0.0.1:7004, ttl_as_sec: 1715245622 }","target":"node::databroker"}
{"timestamp":"2024-05-09T09:06:52.459600Z","level":"INFO","message":"handle_get_resource couldn't found","target":"node::databroker"}

/// NODES do accept the block a few milliseconds later

/tmp/rust-harness.lTtg/node_0.log:{"timestamp":"2024-05-09T09:06:52.513942Z","level":"INFO","event":"block accepted","height":590,"iter":0,"hash":"196b889f51ceca54...d78c685493de136c","txs":0,"state_hash":"9854b9aeee6a3745...2373f8e05976690c","fsv_bitset":3,"ssv_bitset":3,"block_time":1,"generator":"rdHa4H5p1rVtQyqD","dur_ms":"12","label":"Final","ffr":false,"target":"node::chain::acceptor"}
/tmp/rust-harness.lTtg/node_1.log:{"timestamp":"2024-05-09T09:06:52.530574Z","level":"INFO","event":"block accepted","height":590,"iter":0,"hash":"196b889f51ceca54...d78c685493de136c","txs":0,"state_hash":"9854b9aeee6a3745...2373f8e05976690c","fsv_bitset":3,"ssv_bitset":3,"block_time":1,"generator":"rdHa4H5p1rVtQyqD","dur_ms":"9","label":"Final","ffr":false,"target":"node::chain::acceptor"}
/tmp/rust-harness.lTtg/node_2.log:{"timestamp":"2024-05-09T09:06:52.530574Z","level":"INFO","event":"block accepted","height":590,"iter":0,"hash":"196b889f51ceca54...d78c685493de136c","txs":0,"state_hash":"9854b9aeee6a3745...2373f8e05976690c","fsv_bitset":3,"ssv_bitset":3,"block_time":1,"generator":"rdHa4H5p1rVtQyqD","dur_ms":"9","label":"Final","ffr":false,"target":"node::chain::acceptor"}
/tmp/rust-harness.lTtg/node_3.log:{"timestamp":"2024-05-09T09:06:52.509718Z","level":"INFO","event":"block accepted","height":590,"iter":0,"hash":"196b889f51ceca54...d78c685493de136c","txs":0,"state_hash":"9854b9aeee6a3745...2373f8e05976690c","fsv_bitset":3,"ssv_bitset":3,"block_time":1,"generator":"rdHa4H5p1rVtQyqD","dur_ms":"7","label":"Final","ffr":false,"target":"node::chain::acceptor"}

Additional context An identical problematic situation has been occurring in the ITN/testnet, where there is race between GetData request and block acceptance too.

goshawk-3 commented 6 months ago

Option 1: Assumes that GetResource msg brings the block certificate.

function handleGetResource(hash):
begin
    block = searchInLedgerCF(hash)
    if block is not found:
        block = searchInCandidatesCF(hash)
        if block is found:
            attachCertificate(block)
            sendBlockToRequester(block, requester_addr)
end

*CF - RocksDB column Family

Option 2 Assumes that the Sender maintains a cache of received Quorum messages.

// in Receiver:
function handleGetResource(hash):
begin
    block = searchInLedgerCF(hash)
    if block is not found:
        block = searchInCandidatesCF(hash)
        if block is found:
           // sends a candidate without certificate
            sendCandidateToRequester(block, requester_addr)
end

// in Requester:
function handleCandidate(candidate_block)
begin
    fetchCertificateFromCache(Hash)
    block = attachCertificate(candidate_block)

    on_block_event(block)
end
goshawk-3 commented 6 months ago

WRT Option 2:

Store/Persist Quorum Msg in RocksDB ColumnFamily. It will be deleted if it is fetched or it expires. On receiving a block without certificate, the requester looks for the corresponding Certificate in Quorum CF. Thus we can store many Quorums (on-disk) temporarily.