V0.1.1: Final outgoing signing broadcast does not always finish sending

tbraun96 commented 4 months ago

To fix this issue, see this diff here: https://github.com/dfns/cggmp21/compare/v0.1.1...webb-tools:cggmp21:v0.1.1-fix

By yielding to the async runtime, we allow the outgoing messages to completely get flushed to the outgoing stream. Without this yield, I have recorded peers sometimes not sending this final broadcast message, despite other peer(s) finishing the protocol.

Note: this was all tested while using futures channels for the Delivery implementation.

maurges commented 4 months ago

This is a strange problem, especially seeing how another yield happens right after this one in receiving messages. Does it also happen in keygen? Are you sure this is not a quirk of your runtime? Can you try and create a reproduction?

tbraun96 commented 4 months ago

We are using the tokio runtime. I have not noticed this issue in the keygen.

This is a strange problem, especially seeing how another yield happens right after this one in the receiving messages

I do not see this yield?

    outgoings
        .send(Outgoing::broadcast(Msg::Round4(MsgRound4 {
            sigma: partial_sig.sigma,
        })))
        .await
        .map_err(IoError::send_message)?;
    tracer.msg_sent();
    // runtime.yield_now().await; < --- proposed new code

    // Output
    tracer.named_round_begins("Signature reconstruction");

    tracer.receive_msgs();
    let partial_sigs = rounds
        .complete(round4)
        .await
        .map_err(IoError::receive_message)?;
    tracer.msgs_received();
    let sig = {
        let r = NonZero::from_scalar(partial_sig.r);
        let s = NonZero::from_scalar(
            partial_sig.sigma + partial_sigs.iter().map(|m| m.sigma).sum::<Scalar<E>>(),
        );
        Option::zip(r, s).map(|(r, s)| Signature { r, s }.normalize_s())
    };
    let sig_invalid = match &sig {
        Some(sig) => sig.verify(&pk, &message_to_sign).is_err(),
        None => true,
    };
    if sig_invalid {
        // Following the protocol, party should broadcast additional proofs
        // to convince others it didn't cheat. However, since identifiable
        // abort is not implemented yet, this part of the protocol is missing
        return Err(SigningAborted::SignatureInvalid.into());
    }
    let sig = sig.ok_or(SigningAborted::SignatureInvalid)?;

    tracer.protocol_ends();
    Ok(ProtocolOutput::Signature(sig))

There is only one async call following the broadcast where a yield may potentially occur, however, this is runtime-dependent behavior, therefore, we need to explicitly yield to ensure runtime-independence.

tbraun96 commented 4 months ago

To reproduce this error, checkout this branch:

https://github.com/webb-tools/gadget/pull/90

Then alter the root-level Cargo.toml to use cggmp version 0.1.1.

Then, run the test via cargo test --package dfns-cggmp21-protocol test_externalities_keyrotation

You will notice towards the key rotation phase, that only one peer finishes, and on a rare occasion, both peers finish. (note: there are 2 signers).

To prove that this PR works, use the current code in the aforementioned branch without the modification to the cggmp dependency, and observe how both peers always finish the protocol with this yield.

Note: you will need logging too, so export RUST_LOG=gadget=trace

survived commented 4 months ago

By yielding to the async runtime, we allow the outgoing messages to completely get flushed to the outgoing stream

That says that Delivery (specifically, outgoing sink) implementation does not do its job right. When we send a message, we do sink.send() which puts message in a sink and then flushes it, i.e. makes sure all messages are sent. Therefore, no additional yield point should be required to completely get flushed to the outgoing sink.

I see that you're getting this problem when doing simulation of the protocol when all signers co-exist in the same process. I believe that the root of the problem is that signers are competing for the limited resources of the tokio runtime. Note that, by default, #[tokio::test] is single-threaded, so only one computation can be done at the same time, this could be the reason the signers get stuck. I'm not sure exactly why tho — in theory, signers should be able to live with that, but it could be a combination of limited resources + specifics of futures channels implementation. You could try assigning more threads to the test env to see if that helps (see tokio::test macro docs)

survived commented 4 months ago

On other hand, we're doing tests for signing on a single-threaded runtime as well and they work fine. We're using round_based::simulation to simulate I/O between the signers which's based on tokio broadcast channels. Maybe there could be something in the futures-based implementation that causes this behavior or the way you use it? Where in the code you define Delivery implementation? How many signers do you simulate in the tests?

survived commented 4 months ago

I do not see this yield?

Below, under the line where you proposed to do a yield point, there's

rounds
    .complete(round4)
    .await

which does yield if there's not enough messages received. Given that you're saying that one of the signers doesn't send the last broadcast message, it's guaranteed that this code yields.

survived commented 2 months ago

So as I said above, I believe the problem lies within delivery implementation. Flushing the sink of outgoing messages should force all messages to be sent. Apparently, it doesn't happen, so that could be a place where something goes wrong.

dfns / cggmp21

V0.1.1: Final outgoing signing broadcast does not always finish sending #83