bnb-chain / tss-lib

Threshold Signature Scheme, for ECDSA and EDDSA
MIT License
790 stars 271 forks source link

Why rerun BaseUpdate #162

Closed kraymond37 closed 2 years ago

kraymond37 commented 2 years ago

Hi, I have a problem about BaseUpdate, when a round can proceed, why rerun BaseUpdate?

yycen commented 2 years ago

Hello Raymond.K, do you mean the definition of BaseUpdate(), it is recursively called in some branch? This is mainly for the case in reshare, when new parties only run round e.g. 1 3 5, while old parties run round 2, 4, 5. For a new party runs round 1, there could be no message for him in round 2, so the "Canprocceed()" check would never be OK for him and the party stuck. The recursive call bypass this. There is also another solution, you are redefine the nextround() for round 1, to jump to round 3 directly.

kraymond37 commented 2 years ago

I find signing.TestE2EConcurrent would also stuck without reruning BaseUpdate. Because the message consuming process is concurrent, sometimes next round's message is processed a bit early. And from my understanding this branch need not to rerun BaseUpdate?

yycen commented 2 years ago

Yes, the branch you pointed at is for when p.advance() is nil, for when nextround() is nil, so the protocol is just finished for the party. Not sure about the stuck in signing.TestE2EConcurrent you mentioned. Could you change the BaseUpdate() to below and try again, it runs from my computer. The BaseUpdate() is only called from an incoming message, and there is an indicator array round.ok[] to check if received from other party. If all message are received, the round is canprocceed(), and will call advance() in BaseUpdate(). So for a party to advance(), there are two requirements: 1) must have message, otherwise the BaseUpdate() will have no chance to be called, 2) all message received, and the round.ok[] is all true.

func BaseUpdate(p Party, msg ParsedMessage, task string) (ok bool, err *Error) {
    // fast-fail on an invalid message; do not lock the mutex yet
    if _, err := p.ValidateMessage(msg); err != nil {
        return false, err
    }
    p.lock() // data is written to P state below
    defer p.unlock()
    common.Logger.Debugf("party %s received message: %s", p.PartyID(), msg.String())
    if p.round() != nil {
        common.Logger.Debugf("party %s round %d update: %s", p.PartyID(), p.round().RoundNumber(), msg.String())
    }
    if ok, err := p.StoreMessage(msg); err != nil || !ok {
        return false, err
    }
    if p.round() != nil {
        common.Logger.Debugf("party %s: %s round %d update", p.round().Params().PartyID(), task, p.round().RoundNumber())
        if _, err := p.round().Update(); err != nil {
            return false, err
        }
        if p.round().CanProceed() {
            if p.advance(); p.round() != nil {
                if err := p.round().Start(); err != nil {
                    return false, err
                }
                rndNum := p.round().RoundNumber()
                common.Logger.Infof("party %s: %s round %d started", p.round().Params().PartyID(), task, rndNum)
            } else {
                // finished! the round implementation will have sent the data through the `end` channel.
                common.Logger.Infof("party %s: %s finished!", p.PartyID(), task)
            }
        }
    }
    return true, nil
}
kraymond37 commented 2 years ago

Yes, just as you mentioned, signing.TestE2EConcurrent would stuck accidently. It depends which goroutine is faster, next round's message consumer or this round's message consumer. Say sometimes party2 sends round8-msg and round9-msg in a very short time to party1, and party 1 almost consume the two messages simultaneously, but round9-msg is consumed faster. So I thought the reruning BaseUpdate was for concurrent until you say it is for resharing.

kraymond37 commented 2 years ago

By the way, I delete the old testfixtures and set participants to 2 or 3.

yycen commented 2 years ago

You are right. I reproduce the stuck in sigining.E2E. What I find is that although the round.out is FIFO, and r8msg is before r9msg in the channel. The updater in test file call go-routine go updater(party, msg) could possible swap the order. So in order to use a non-recursive BaseUpdate, it would also need to fit updater to ensure that msg are updated by order.