[Fix] Perform block advancement after quorum check

raychu86 commented 3 months ago

Motivation

This PR updates the final stage of validator syncing by adding an additional quorum check. Instead of blindly adding the block to ledger, we now check that each block's leader certificate reached quorum based on the certificates in the subsequent block.

This solves 2 problems:

Naive validator sync

A validator synced up to tip and added the last block (X) without performing any quorum checks.
The validator would start processing consensus messages and process incoming certificates.
It would reach a commit state and try to construct X through the standard non-sync procedure.
However the ledger already stored X.
An error would repeat, halting block advancement on this node until the node was rebooted.

`Unable to advance to the next block - Failed to speculate on transactions - Failed to post-ratify - Next round {} must be greater than current round {}`

The change now skips adding the latest block because we can't guarantee the block meets the quorum. So we will process the block through consensus instead of the sync process.

Malicious validator sync

The malicious sync case is defined in a hackerOne finding: https://github.com/AleoHQ/snarkOS/issues/3226

The new quorum checks should ensure that we are not processing blocks blindly.

raychu86 commented 3 months ago

@ghostant-1017 Could you confirm that this addresses your finding - https://github.com/AleoHQ/snarkOS/issues/3226?

ghostant-1017 commented 3 months ago

@ghostant-1017 Could you confirm that this addresses your finding - #3226?

Of course! I will test it out.

ghostant-1017 commented 3 months ago

Hey @raychu86 , This PR indeed addresses the issue #3226.

However, during my review process, I identified another potential DOS attack vector. It only requires a minor modification on top of the existing changes, I haven't conducted practical tests yet.

The main logic is as follows:

Upon receiving an invalid Block, we invoke self.storage.sync_certificate_with_block. Regardless of its success, we always call bft_sender.send_sync_bft(certificate).
Upon receiving rx_sync_bft.recv(), self.update_dag is called.
In the update_dag function, we insert the certificate into the subdag.
The BatchCertificate can contain a maximum of 200 signatures. This means that attackers can continuously send invalid Blocks(different round far in the future), causing honest validator nodes to insert meaningless BatchCertificates into memory. Eventually, this could lead to a memory overflow.

I can probably confirm the effectiveness of this attack. Please recheck whether this attack is valid.I won't submit another report for this, I hope you consider this potential DOS attack when assessing the severity of this report.

howardwu commented 3 months ago

Thanks for flagging this @ghostant-1017!

I think the attack described may be possible, there's one thing to consider: the sync logic only takes blocks from BlockResponse messages/events. I believe we check (when receiving a BlockResponse) that we requested the specific block, and then clear this request from the cache (meaning a future BlockResponse for the same block will not be allowed in).

If my understanding is correct, this suggests we could have an attacker send this block once, but then it no longer works.

Perhaps one thing we could do is have sync_storage_with_block check at the beginning that:

// Return early if this block has already been processed.
if self.ledger.contains_block_height(block.height()) {
    return Ok(());
}

ghostant-1017 commented 3 months ago

Hey, @howardwu
Maybe the protection is not enough. Assume the latest_height = 10000, in this situation, the attacker only generate block at height=10001 and broadcast. I think it's possible to generate blocks at height=10002, height=10003 and even higher blocks. And the invalid blocks will always pass the height-contain check.

howardwu commented 3 months ago

Thanks for the feedback. Wouldn't it require at least 2f+1 private keys to be compromised in order to create a consecutive exploit? We reference the committee_lookback to ensure the next subdag is well-formed. I suppose the attacker would need to have compromised a substantial set of the existing validators in order to pull this off. Let me know if you see another angle.

ghostant-1017 commented 3 months ago

It seems that the self.update_dag will be called before quorum check. self.update_dag will be called at: https://github.com/AleoHQ/snarkOS/pull/3232/files#diff-f82e0c9ecc0d4ff1e73050db6da1dc5909f4f8c5afc98d3effd254a440f079e3L355 quorum_check will be called at: https://github.com/AleoHQ/snarkOS/pull/3232/files#diff-f82e0c9ecc0d4ff1e73050db6da1dc5909f4f8c5afc98d3effd254a440f079e3R425

@howardwu Correct me if I'm wrong.

howardwu commented 3 months ago

Yes, the idea is that the check in update_dag and subsequently here both enforce availability or quorum threshold. So wouldn't the attacker need to compromise at least f private keys from the committee to allow for the node to sync the block?

(Sorry if I missed the point)

raychu86 commented 3 months ago

I think Ghostant is saying we are calling update_dag before performing any quorum/availability checks. So if malicious nodes send a faulty block with garbage certificates, we add it to storage before performing any quorum checks.

I believe our mitigation for this is in the BlockSync::construct_request where we handle the dishonest case.

@ghostant-1017 let us know if our understanding of your concern is correct.

ghostant-1017 commented 3 months ago

@raychu86 Yes, you get me, that's what I mean. So you mean the BlockSync::construct_request will handle this case? I will take a look into it.

ghostant-1017 commented 3 months ago

@raychu86 @howardwu I don't think BlockRequest::construct_request is secure in this scenario. In the honest check within construct_request, the attacker is the only peer with a height greater than all other nodes, it will pass all checks.

Do you agree? I'm not entirely certain about my idea without a PoC. I'd be willing to attempt a PoC to validate it if necessary.

raychu86 commented 3 months ago

@ghostant-1017 I agree that this could be an issue, and we should have some more mitigating checks to prevent this. However this attack existed prior to this PR, so I'm of the opinion we try to mitigate it in a separate PR after this is merged.

I don't believe it is quite as straightforward to address in this PR, because we need to perform update_dag and sync_certificate_with_block in order to even perform the availability threshold checks.

In fact, I believe that this is a more fundamental issue with receiving blocks, as this attack using malicious blocks could be done to clients and validators performing the non-BFT sync. Bitcoin has this issue as well, but i believe they get around this by doing chain-reorgs and rollbacks.

Couple ideas for mitigation:

Have nodes declare their trusted sync nodes.
Enforce at least X other sync nodes has the block before requesting.
Initialize community trusted sync nodes.

Let us know if you agree or have other insights.

ghostant-1017 commented 3 months ago

@raychu86 Yes! Our views have aligned now. One more suggestion: Maybe we can check the certificate's round is not too far when update_dag?

And I'm trying finding a way to simplify the syncing process. I'm not quite sure, but I will share with you if I'm ready.

AleoNet / snarkOS