Weird communication errors while setting up federation

douglaz commented 10 months ago

If I setup a federation using two or more machines in the same datacenter, everything works fine. But if one of the machines is a home computer on another continent, then weird stuff starts to happen. In this example, DKG actually failed.

Note 1: all ports are open and properly forwarded. Communication seems to be okay in both directions. Home internet is 1Gbit/s fiber, so no bandwidth issues besides international link constraints

Note 2: logs abbreviate for repeated or sequential messages by replacing them with (...)

home computer logs:

fedimintd-mutinynet-fedimintd-1    | 2023-07-20T19:02:10.714149Z  INFO connection{remote_addr=<redacted>:47346 conn_id=17}: net::peer::dkg: Peer 0 running distributed key generation...
fedimintd-mutinynet-fedimintd-1    | 2023-07-20T19:02:12.543892Z  INFO run_io_thread: net::peer: Replacing existing connection
(...)
fedimintd-mutinynet-fedimintd-1    | 2023-07-20T19:02:21.222195Z  INFO run_io_thread: net::peer: Replacing existing connection
fedimintd-mutinynet-fedimintd-1    | 2023-07-20T19:02:21.239828Z  INFO run_io_thread: net::peer: Received old message expected=MessageId(90) received=MessageId(1)
(...)
fedimintd-mutinynet-fedimintd-1    | 2023-07-20T19:02:21.245097Z  INFO run_io_thread: net::peer: Received old message expected=MessageId(90) received=MessageId(89)
fedimintd-mutinynet-fedimintd-1    | 2023-07-20T19:02:22.010248Z  INFO run_io_thread: net::peer: Replacing existing connection
(...)
fedimintd-mutinynet-fedimintd-1    | 2023-07-20T19:02:26.160320Z  INFO run_io_thread: net::peer: Replacing existing connection
fedimintd-mutinynet-fedimintd-1    | 2023-07-20T19:02:27.275235Z  INFO run_io_thread: net::peer: Received old message expected=MessageId(530) received=MessageId(136)
fedimintd-mutinynet-fedimintd-1    | 2023-07-20T19:02:27.275263Z  INFO run_io_thread: net::peer: Replacing existing connection
(...)
fedimintd-mutinynet-fedimintd-1    | 2023-07-20T19:02:58.968762Z  INFO run_io_thread: net::peer: Replacing existing connection

Remote datacenter computer:

fedimintd_1          | 2023-07-20T19:02:11.899547Z  INFO connection{remote_addr=<redacted>:34162 conn_id=2}: net::peer::dkg: Peer 1 running distributed key generation...
fedimintd_1          | 2023-07-20T19:02:12.643151Z  INFO run_io_thread: net::peer: Replacing existing connection
fedimintd_1          | 2023-07-20T19:02:13.667591Z  INFO connection{remote_addr=<redacted>:53844 conn_id=14}: jsonrpsee_server::server: Accepting new connection 6/10
fedimintd_1          | 2023-07-20T19:02:13.676290Z  INFO run_io_thread: net::peer: Replacing existing connection
fedimintd_1          | 2023-07-20T19:02:14.065690Z ERROR connection{remote_addr=<redacted>:53844 conn_id=14}: jsonrpsee_server::transport::ws: WS transport error: i/o error: Transport endpoint is not connected (os error 107); terminate connection: 14
fedimintd_1          | 2023-07-20T19:02:14.293186Z  INFO run_io_thread: net::peer: Replacing existing connection
(...)
fedimintd_1          | 2023-07-20T19:02:16.813292Z  INFO run_io_thread: net::peer: Replacing existing connection
fedimintd_1          | 2023-07-20T19:02:17.242244Z  INFO connection{remote_addr=<redacted>:53854 conn_id=15}: jsonrpsee_server::server: Accepting new connection 6/10
fedimintd_1          | 2023-07-20T19:02:17.641922Z ERROR connection{remote_addr=<redacted>:53854 conn_id=15}: jsonrpsee_server::transport::ws: WS transport error: i/o error: Transport endpoint is not connected (os error 107); terminate connection: 15
fedimintd_1          | 2023-07-20T19:02:18.241657Z  INFO run_io_thread: net::peer: Replacing existing connection
fedimintd_1          | 2023-07-20T19:02:18.251738Z  INFO run_io_thread: net::peer: Received old message expected=MessageId(113) received=MessageId(1)
(...)
fedimintd_1          | 2023-07-20T19:02:18.267630Z  INFO run_io_thread: net::peer: Received old message expected=MessageId(113) received=MessageId(92)
fedimintd_1          | 2023-07-20T19:02:18.869571Z  INFO run_io_thread: net::peer: Replacing existing connection
fedimintd_1          | 2023-07-20T19:02:19.693103Z  INFO run_io_thread: net::peer: Replacing existing connection
fedimintd_1          | 2023-07-20T19:02:20.072910Z  INFO connection{remote_addr=<redacted>:53858 conn_id=16}: jsonrpsee_server::server: Accepting new connection 6/10
fedimintd_1          | 2023-07-20T19:02:20.483714Z ERROR connection{remote_addr=<redacted>:53858 conn_id=16}: jsonrpsee_server::transport::ws: WS transport error: i/o error: Transport endpoint is not connected (os error 107); terminate connection: 16
fedimintd_1          | 2023-07-20T19:02:20.501468Z  INFO run_io_thread: net::peer: Replacing existing connection
fedimintd_1          | 2023-07-20T19:02:20.515497Z  INFO run_io_thread: net::peer: Received old message expected=MessageId(186) received=MessageId(113)
(...)
fedimintd_1          | 2023-07-20T19:02:20.557392Z  INFO run_io_thread: net::peer: Received old message expected=MessageId(186) received=MessageId(158)
fedimintd_1          | 2023-07-20T19:02:21.289447Z  INFO run_io_thread: net::peer: Replacing existing connection
fedimintd_1          | 2023-07-20T19:02:21.940945Z  INFO run_io_thread: net::peer: Replacing existing connection
fedimintd_1          | 2023-07-20T19:02:23.274142Z  INFO run_io_thread: net::peer: Replacing existing connection
fedimintd_1          | 2023-07-20T19:02:23.467870Z  INFO connection{remote_addr=<redacted>:47002 conn_id=17}: jsonrpsee_server::server: Accepting new connection 6/10
fedimintd_1          | 2023-07-20T19:02:23.866845Z ERROR connection{remote_addr=<redacted>:47002 conn_id=17}: jsonrpsee_server::transport::ws: WS transport error: i/o error: Transport endpoint is not connected (os error 107); terminate connection: 17
fedimintd_1          | 2023-07-20T19:02:24.490124Z  INFO run_io_thread: net::peer: Replacing existing connection
fedimintd_1          | 2023-07-20T19:02:24.692470Z  INFO run_io_thread: net::peer: Received old message expected=MessageId(238) received=MessageId(186)
(...)
fedimintd_1          | 2023-07-20T19:02:24.730179Z  INFO run_io_thread: net::peer: Received old message expected=MessageId(238) received=MessageId(237)
fedimintd_1          | 2023-07-20T19:02:26.302726Z  INFO connection{remote_addr=<redacted>:47018 conn_id=18}: jsonrpsee_server::server: Accepting new connection 6/10
fedimintd_1          | 2023-07-20T19:02:26.312642Z  INFO run_io_thread: net::peer: Replacing existing connection
fedimintd_1          | 2023-07-20T19:02:26.723068Z ERROR connection{remote_addr=<redacted>:47018 conn_id=18}: jsonrpsee_server::transport::ws: WS transport error: i/o error: Transport endpoint is not connected (os error 107); terminate connection: 18
fedimintd_1          | 2023-07-20T19:02:27.164129Z  INFO run_io_thread: net::peer: Replacing existing connection
fedimintd_1          | 2023-07-20T19:02:28.351101Z  INFO run_io_thread: net::peer: Replacing existing connection
fedimintd_1          | 2023-07-20T19:02:28.362582Z  INFO run_io_thread: net::peer: Received old message expected=MessageId(561) received=MessageId(368)
(...)
fedimintd_1          | 2023-07-20T19:02:28.474514Z  INFO run_io_thread: net::peer: Received old message expected=MessageId(561) received=MessageId(560)
fedimintd_1          | 2023-07-20T19:02:29.062812Z  INFO connection{remote_addr=<redacted>:34162 conn_id=2}: net::peer::dkg: Sending confirmations to other peers.
fedimintd_1          | 2023-07-20T19:02:29.063067Z  INFO connection{remote_addr=<redacted>:34162 conn_id=2}: net::peer::dkg: Waiting for confirmations from other peers.
fedimintd_1          | 2023-07-20T19:02:29.723137Z  INFO connection{remote_addr=<redacted>:47034 conn_id=19}: jsonrpsee_server::server: Accepting new connection 6/10
fedimintd_1          | 2023-07-20T19:02:30.121731Z ERROR connection{remote_addr=<redacted>:47034 conn_id=19}: jsonrpsee_server::transport::ws: WS transport error: i/o error: Transport endpoint is not connected (os error 107); terminate connection: 19
fedimintd_1          | 2023-07-20T19:02:30.151219Z  INFO run_io_thread: net::peer: Replacing existing connection
fedimintd_1          | 2023-07-20T19:02:31.130850Z  INFO run_io_thread: net::peer: Replacing existing connection
fedimintd_1          | 2023-07-20T19:02:31.998347Z  INFO run_io_thread: net::peer: Replacing existing connection
fedimintd_1          | 2023-07-20T19:02:32.553553Z  INFO connection{remote_addr=<redacted>:46852 conn_id=20}: jsonrpsee_server::server: Accepting new connection 6/10
fedimintd_1          | 2023-07-20T19:02:32.949790Z  INFO run_io_thread: net::peer: Replacing existing connection
fedimintd_1          | 2023-07-20T19:02:33.059406Z ERROR connection{remote_addr=<redacted>:46852 conn_id=20}: jsonrpsee_server::transport::ws: WS transport error: i/o error: Transport endpoint is not connected (os error 107); terminate connection: 20
fedimintd_1          | 2023-07-20T19:02:34.184564Z  INFO run_io_thread: net::peer: Replacing existing connection
fedimintd_1          | 2023-07-20T19:02:35.255629Z  INFO run_io_thread: net::peer: Replacing existing connection
fedimintd_1          | 2023-07-20T19:02:35.461529Z  INFO connection{remote_addr=<redacted>:46868 conn_id=21}: jsonrpsee_server::server: Accepting new connection 6/10
fedimintd_1          | 2023-07-20T19:02:36.038083Z ERROR connection{remote_addr=<redacted>:46868 conn_id=21}: jsonrpsee_server::transport::ws: WS transport error: i/o error: Transport endpoint is not connected (os error 107); terminate connection: 21
fedimintd_1          | 2023-07-20T19:02:36.479843Z  INFO run_io_thread: net::peer: Replacing existing connection
fedimintd_1          | 2023-07-20T19:02:37.448694Z  INFO run_io_thread: net::peer: Replacing existing connection
fedimintd_1          | 2023-07-20T19:02:38.830992Z  INFO run_io_thread: net::peer: Replacing existing connection
fedimintd_1          | 2023-07-20T19:02:39.045601Z  INFO connection{remote_addr=<redacted>:46878 conn_id=22}: jsonrpsee_server::server: Accepting new connection 6/10
fedimintd_1          | 2023-07-20T19:02:39.587477Z ERROR connection{remote_addr=<redacted>:46878 conn_id=22}: jsonrpsee_server::transport::ws: WS transport error: i/o error: Transport endpoint is not connected (os error 107); terminate connection: 22
fedimintd_1          | 2023-07-20T19:02:39.808029Z  INFO run_io_thread: net::peer: Replacing existing connection
fedimintd_1          | 2023-07-20T19:02:40.746505Z  INFO run_io_thread: net::peer: Replacing existing connection
fedimintd_1          | 2023-07-20T19:02:41.981746Z  INFO run_io_thread: net::peer: Replacing existing connection
fedimintd_1          | 2023-07-20T19:02:42.034322Z  INFO connection{remote_addr=<redacted>:38974 conn_id=23}: jsonrpsee_server::server: Accepting new connection 6/10
fedimintd_1          | 2023-07-20T19:02:42.525462Z ERROR connection{remote_addr=<redacted>:38974 conn_id=23}: jsonrpsee_server::transport::ws: WS transport error: i/o error: Transport endpoint is not connected (os error 107); terminate connection: 23
fedimintd_1          | 2023-07-20T19:02:42.945827Z  INFO run_io_thread: net::peer: Replacing existing connection
fedimintd_1          | 2023-07-20T19:02:43.838556Z  INFO run_io_thread: net::peer: Replacing existing connection
fedimintd_1          | 2023-07-20T19:02:45.226698Z  INFO run_io_thread: net::peer: Replacing existing connection
fedimintd_1          | 2023-07-20T19:02:45.443084Z  INFO connection{remote_addr=<redacted>:38976 conn_id=24}: jsonrpsee_server::server: Accepting new connection 6/10
fedimintd_1          | 2023-07-20T19:02:45.979793Z ERROR connection{remote_addr=<redacted>:38976 conn_id=24}: jsonrpsee_server::transport::ws: WS transport error: i/o error: Transport endpoint is not connected (os error 107); terminate connection: 24
fedimintd_1          | 2023-07-20T19:02:46.198558Z  INFO run_io_thread: net::peer: Replacing existing connection
fedimintd_1          | 2023-07-20T19:02:47.126712Z  INFO run_io_thread: net::peer: Replacing existing connection
fedimintd_1          | 2023-07-20T19:02:48.402344Z  INFO run_io_thread: net::peer: Replacing existing connection
fedimintd_1          | 2023-07-20T19:02:48.420062Z  INFO connection{remote_addr=<redacted>:38982 conn_id=25}: jsonrpsee_server::server: Accepting new connection 6/10
fedimintd_1          | 2023-07-20T19:02:48.993393Z ERROR connection{remote_addr=<redacted>:38982 conn_id=25}: jsonrpsee_server::transport::ws: WS transport error: i/o error: Transport endpoint is not connected (os error 107); terminate connection: 25
fedimintd_1          | 2023-07-20T19:02:49.413231Z  INFO run_io_thread: net::peer: Replacing existing connection
fedimintd_1          | 2023-07-20T19:02:50.268860Z  INFO run_io_thread: net::peer: Replacing existing connection
fedimintd_1          | 2023-07-20T19:02:51.638846Z  INFO run_io_thread: net::peer: Replacing existing connection
fedimintd_1          | 2023-07-20T19:02:51.881008Z  INFO connection{remote_addr=<redacted>:60856 conn_id=26}: jsonrpsee_server::server: Accepting new connection 6/10
fedimintd_1          | 2023-07-20T19:02:52.405201Z ERROR connection{remote_addr=<redacted>:60856 conn_id=26}: jsonrpsee_server::transport::ws: WS transport error: i/o error: Transport endpoint is not connected (os error 107); terminate connection: 26
fedimintd_1          | 2023-07-20T19:02:52.621806Z  INFO run_io_thread: net::peer: Replacing existing connection
fedimintd_1          | 2023-07-20T19:02:53.557667Z  INFO run_io_thread: net::peer: Replacing existing connection
fedimintd_1          | 2023-07-20T19:02:54.793772Z  INFO run_io_thread: net::peer: Replacing existing connection
fedimintd_1          | 2023-07-20T19:02:54.843049Z  INFO connection{remote_addr=<redacted>:60858 conn_id=27}: jsonrpsee_server::server: Accepting new connection 6/10
fedimintd_1          | 2023-07-20T19:02:55.191758Z  WARN net::peer: Peer 0 has 10001 pending messages. Dropping new message.
(...)
fedimintd_1          | 2023-07-20T19:02:55.391306Z  WARN net::peer: Peer 0 has 10001 pending messages. Dropping new message.
fedimintd_1          | 2023-07-20T19:02:55.391598Z ERROR connection{remote_addr=<redacted>:60858 conn_id=27}: jsonrpsee_server::transport::ws: WS transport error: i/o error: Transport endpoint is not connected (os error 107); terminate connection: 27
fedimintd_1          | 2023-07-20T19:02:55.811309Z  INFO run_io_thread: net::peer: Replacing existing connection
fedimintd_1          | 2023-07-20T19:02:55.864001Z  WARN net::peer: Peer 0 has 10001 pending messages. Dropping new message.
(...)
fedimintd_1          | 2023-07-20T19:02:56.276473Z  WARN net::peer: Peer 0 has 10001 pending messages. Dropping new message.
fedimintd_1          | 2023-07-20T19:02:56.695335Z  INFO run_io_thread: net::peer: Replacing existing connection
fedimintd_1          | 2023-07-20T19:02:56.742026Z  WARN net::peer: Peer 0 has 10001 pending messages. Dropping new message.
(...)
fedimintd_1          | 2023-07-20T19:02:58.448011Z  WARN net::peer: Peer 0 has 10001 pending messages. Dropping new message.
fedimintd_1          | 2023-07-20T19:02:58.448121Z  INFO connection{remote_addr=<redacted>:60862 conn_id=28}: jsonrpsee_server::server: Accepting new connection 6/10
fedimintd_1          | 2023-07-20T19:02:58.846899Z ERROR connection{remote_addr=<redacted>:60862 conn_id=28}: jsonrpsee_server::transport::ws: WS transport error: i/o error: Transport endpoint is not connected (os error 107); terminate connection: 28
fedimintd_1          | 2023-07-20T19:02:58.869064Z  INFO run_io_thread: net::peer: Replacing existing connection
fedimintd_1          | 2023-07-20T19:02:58.922201Z  WARN net::peer: Peer 0 has 10001 pending messages. Dropping new message.
(...)
fedimintd_1          | 2023-07-20T19:02:59.075063Z  WARN net::peer: Peer 0 has 10001 pending messages. Dropping new message.
fedimintd_1          | 2023-07-20T19:02:59.075453Z ERROR connection{remote_addr=<redacted>:34162 conn_id=2}: net::peer::dkg: Timeout waiting for dkg completion confirmation from other peers
fedimintd_1          | 2023-07-20T19:02:59.075876Z  INFO connection{remote_addr=<redacted>:34162 conn_id=2}: net::peer: Distributed key generation has completed successfully!
fedimintd_1          | 2023-07-20T19:02:59.076230Z  INFO connection{remote_addr=<redacted>:34162 conn_id=2}: timing: Operation timing name=distributed-gen duration_ms=47177
fedimintd_1          | 2023-07-20T19:02:59.079018Z  INFO connection{remote_addr=<redacted>:34162 conn_id=2}: task: Waiting for task to finish task=io-thread-peer-0
fedimintd_1          | 2023-07-20T19:02:59.079722Z  INFO run_io_thread: net::peer: Shutting down peer connection state machine peer=PeerId(0)
fedimintd_1          | 2023-07-20T19:02:59.081718Z  INFO connection{remote_addr=<redacted>:34162 conn_id=2}: task: Task finished task=io-thread-peer-0
fedimintd_1          | 2023-07-20T19:02:59.082166Z  INFO connection{remote_addr=<redacted>:34162 conn_id=2}: task: Waiting for task to finish task=listen task
fedimintd_1          | 2023-07-20T19:02:59.082599Z  INFO connection{remote_addr=<redacted>:34162 conn_id=2}: task: Task finished task=listen task

dpc commented 10 months ago

WARN net::peer: Peer 0 has 10001 pending messages. Dropping new message.

This does not look good. Suggests that for some reason this machine can't send things to to peer 0.

But in such a case - how did the DKG reached the final step?

If peer0 is the home computer that would explain:

Received old message expected=MessageId(90) received=MessageId(1)

Which means the other peer is sending again messages that were received. This all smell like some logical error in the peer outgoing queue handling, possibly triggered by some connectivity issues.

dpc commented 10 months ago

Right. One peer is sending messages but for some reason things they are never being delivered so tries to resend them. The other peer is actually receiving them and responds, and the responses are being delivered just fine, so things DKG finished just fine.

douglaz commented 10 months ago

I've tested iperf3 on the same ports used by fedimint and I get an average of 120 Mbit/s of data transfer in both directions, so it's not a bandwidth issue. Also mtr shows no dropped packets.

But when running in debug mode I see that unexpected end of file and Connection reset by peer (os error 104) errors are causing reconnections.

douglaz commented 10 months ago

Increasing MAX_PEER_OUT_OF_ORDER_MESSAGES and that ack timeout makes DKG finish successfully and move server to VerifyingConfigs.

douglaz commented 10 months ago

Tried again with increased MAX_PEER_OUT_OF_ORDER_MESSAGES and got:

fedimintd-mutinynet-fedimintd-1  | 2023-07-20T22:03:48.163600Z DEBUG run_io_thread: fedimint_server::net::queue: Received ACK for MessageId(670)
fedimintd-mutinynet-fedimintd-1  | 2023-07-20T22:03:48.178141Z DEBUG run_io_thread: net::peer: Peer disconnected peer=PeerId(1) err=Connection reset by peer (os error 104) disconnect_count=0
fedimintd-mutinynet-fedimintd-1  | 2023-07-20T22:03:48.178154Z DEBUG run_io_thread: net::peer: Scheduling reopening of connection disconnect_count=1 peer=PeerId(1) delay_secs=0.01
fedimintd-mutinynet-fedimintd-1  | 2023-07-20T22:03:48.189240Z DEBUG run_io_thread: net::peer: Trying to reconnect
fedimintd-mutinynet-fedimintd-1  | 2023-07-20T22:03:48.266317Z  INFO connection{remote_addr=<REDACTED>:56674 conn_id=3}: timing: Operation timing name=distributed-gen duration_ms=60000
fedimintd-mutinynet-fedimintd-1  | 2023-07-20T22:03:48.388358Z DEBUG rustls::server::hs: decided upon suite TLS13_AES_256_GCM_SHA384    
fedimintd-mutinynet-fedimintd-1  | 2023-07-20T22:03:48.392349Z DEBUG run_io_thread: rustls::client::hs: No cached session for DnsName(DnsName(DnsName("peerr")))    
fedimintd-mutinynet-fedimintd-1  | 2023-07-20T22:03:48.392393Z DEBUG run_io_thread: rustls::client::hs: Not resuming any session    
fedimintd-mutinynet-fedimintd-1  | 2023-07-20T22:03:48.607509Z DEBUG run_io_thread: rustls::client::hs: Using ciphersuite TLS13_AES_256_GCM_SHA384    
fedimintd-mutinynet-fedimintd-1  | 2023-07-20T22:03:48.607535Z DEBUG run_io_thread: rustls::client::tls13: Not resuming    
fedimintd-mutinynet-fedimintd-1  | 2023-07-20T22:03:48.607653Z DEBUG run_io_thread: rustls::client::tls13: TLS1.3 encrypted extensions: [ServerNameAck]    
fedimintd-mutinynet-fedimintd-1  | 2023-07-20T22:03:48.607664Z DEBUG run_io_thread: rustls::client::hs: ALPN protocol is None    
fedimintd-mutinynet-fedimintd-1  | 2023-07-20T22:03:48.607678Z DEBUG run_io_thread: rustls::client::tls13: Got CertificateRequest CertificateRequestPayloadTLS13 { context: , extensions: [SignatureAlgorithms([ECDSA_NISTP384_SHA384, ECDSA_NISTP256_SHA256, ED25519, RSA_PSS_SHA
512, RSA_PSS_SHA384, RSA_PSS_SHA256, RSA_PKCS1_SHA512, RSA_PKCS1_SHA384, RSA_PKCS1_SHA256]), AuthorityNames([3010310e300c06035504030c057065657267, 3010310e300c06035504030c057065657272])] }    
fedimintd-mutinynet-fedimintd-1  | 2023-07-20T22:03:48.607694Z DEBUG run_io_thread: rustls::client::common: Attempting client auth    
fedimintd-mutinynet-fedimintd-1  | 2023-07-20T22:03:48.608246Z DEBUG run_io_thread: net::peer: Received incoming connection peer=PeerId(1) disconnect_count=1 resend_queue_len=125
fedimintd-mutinynet-fedimintd-1  | 2023-07-20T22:03:48.620089Z DEBUG run_io_thread: net::peer: Exiting peer connection IO task - parent disconnected
fedimintd-mutinynet-fedimintd-1  | 2023-07-20T22:03:48.620189Z  INFO run_io_thread: net::peer: Shutting down peer connection state machine peer=PeerId(1)
fedimintd-mutinynet-fedimintd-1  | 2023-07-20T22:03:49.474863Z DEBUG rustls::server::hs: decided upon suite TLS13_AES_256_GCM_SHA384    
fedimintd-mutinynet-fedimintd-1  | 2023-07-20T22:03:49.672663Z  WARN net::peer: Could not send incoming connection to peer io task (possibly banned) peer=PeerId(1)

dpc commented 10 months ago

This got to be some logical error somewhere, not a performance issue. And if it's performance it would rather be latency (some timeout somewhere).

douglaz commented 10 months ago

I've created a minimal rust client-server software using tokio's TcpStream/TcpListener and rustls the same way fedimint does for its communication and I get no dropped connections or any type of error on the network layer.

So the problem must be on a higher layer inside fedimint. There are no issues on the hardware or on rustls/tokio level.

justinmoon commented 9 months ago

@douglaz, @EthnTuttle @m1sterc001guy, @okjodom and me were just attemping federation setup and saw this. In the logs shared below, IIRC @EthnTuttle and @m1sterc001guy had digital ocean servers in New York, I had one in San Francisco, and @okjodom had one in Vancouver. Config API said DKG succeeded for @EthnTuttle and @m1sterc001guy, but failed for me and @okjodom. We had a previous run that failed similarly with @douglaz running a server in India.

My logs: https://gist.github.com/justinmoon/60e205466051d21d9da250294db3ca1d

EthnTuttle commented 9 months ago

https://gist.github.com/EthnTuttle/14cc3a8250befa0daff604555be35e14

okjodom commented 9 months ago

my logs https://gist.github.com/okjodom/8f63aa54cbc8c2650b12a2f4f7f1ff20

m1sterc001guy commented 9 months ago

https://gist.github.com/m1sterc001guy/599047f12d6494476121dbfbefba4d77

dpc commented 9 months ago

Debug logging level. My horse for logs with debug level. :D

dpc commented 9 months ago

OK. Initial theory:

The root of all evil is state. Especially mutable and shared. Seems to me like peer communication is using a connection where both sides think they are in control of the connection. So with higher latencies it is possible to get into a sort of a reconnect ping-pong, when one sides reconnects, starts re-sending all messages, and during that time the other side reconnects, and starts sending messages themselves... just to get new connection from the other side...

dpc commented 9 months ago

Anyone with the affected setup would like to try https://github.com/fedimint/fedimint/pull/2841 ?

douglaz commented 9 months ago

Anyone with the affected setup would like to try #2841 ?

Works for me!

fedimint / fedimint

Weird communication errors while setting up federation #2800