axelarnetwork / tofn

A threshold cryptography library in Rust
Apache License 2.0
112 stars 23 forks source link

Nondeterministic test failure on ubuntu in release mode #102

Closed ggutoski closed 3 years ago

ggutoski commented 3 years ago

Test keeps failing on github workflow [log entry] despite passing on everyone's local machine. commit: 558950dd1a6678903adee3149ebdbdcd5fc4e34a

Earlier versions of this test prior to the above commit PR #100 have never failed.

Github workflow runs cargo test --release --all-features on ubuntu-18.04.

After extensive experimentation it seems that the test fails only in --release mode and only on ubuntu, and even then it does not always fail. I have an old Mint 18 laptop (ubuntu-16.04) on which I can reliably reproduce failure via cargo test --release --all-features. Failure occurs only rarely on that machine when the test is run by itself. eg:

cargo test --release --all-features --test integration -- multi_thread::basic_correctness

I do not know whether --all-features is necessary to reproduce failure. (I think I observed failure once without --all-features but I can't remember.)

milapsheth commented 3 years ago

This issue was caused due to messages from the next round being received early by a party that was still waiting on a current round message. For e.g., all parties have sent their Round 1 messages but Bob is waiting to receive Charlie's Round 1 message (due to network latency). Meanwhile, Alice received all messages and proceeded to Round 2 and started broadcasting her Round 2 message that Bob receives. As a result, Bob thinks Alice incorrectly sent a duplicate message.

Solving this problem requires some effort #116, but instead we can assume due to axelar-core that a Round 2 message is only received after all Round 1 messages. This is because if Alice goes to Round 2, there has been a block that contains the final Round 1 message. Since axelar-core processes blocks in order, Bob will also see the Round 1 messages before Round 2.