Network fork detected in multi-region simulations

helins commented 1 year ago

This seems to happen especially when 2 regions are involved, for some unknown reasons:

["2023-07-17T10:37:01.58" :warn [convex.peer.AThreadedComponent 34] ["Unexpected exception in ComponentTask" {:trace [[convex.core.Peer updateState "Peer.java" 474] [convex.peer.CVMExecutor loop "CVMExecutor.java" 53] [convex.peer.AThreadedComponent$ComponentTask run "AThreadedComponent.java" 29] [java.lang.Thread run "Thread.java" 833]],:message "Network Fork detected but fork recovery diabled!",:exception java.lang.IllegalStateException}]]

Example of a distribution for a 2-region, 24 peers setup, 10 min run (counts):

R2-24/log/peer/21.cvx:21273
R2-24/log/peer/1.cvx:27
R2-24/log/peer/4.cvx:20399
R2-24/log/peer/5.cvx:34
R2-24/log/peer/0.cvx:31929
R2-24/log/peer/12.cvx:30259
R2-24/log/peer/13.cvx:27032
R2-24/log/peer/17.cvx:19059

It is odd that it happens so systematically. A real fork should be a rare event, especially when running in reliable data centers.

mikera commented 1 year ago

Hmmm 2 regions may be a slightly weird edge case.... both regions presumably have fast internal connections so you may get two "blocks" of peers disagreeing with each other before they have a chance to get aligned across regions. will take a look at this case.

We should be enabling fork recovery soon anyway (in which case this becomes a less serious issue)

helins commented 1 year ago

But why is 2 regions is radically different from 3?

In the few 3-region runs I did, only 1 run had 1 peer with only 28 such exceptions.

Curiously, it also showed up in a single region run with 36 peers and only 1 user, albeit to a much lesser extent and the run did complete:

R1-36-LATENCY/log/peer/27.cvx:4
R1-36-LATENCY/log/peer/32.cvx:2
R1-36-LATENCY/log/peer/30.cvx:6
R1-36-LATENCY/log/peer/25.cvx:2
R1-36-LATENCY/log/peer/3.cvx:2
R1-36-LATENCY/log/peer/28.cvx:2
R1-36-LATENCY/log/peer/16.cvx:2

Convex-Dev / convex

Network fork detected in multi-region simulations #496