Closed douglaz closed 9 months ago
WARN net::peer: Peer 0 has 10001 pending messages. Dropping new message.
This does not look good. Suggests that for some reason this machine can't send things to to peer 0.
But in such a case - how did the DKG reached the final step?
If peer0
is the home computer that would explain:
Received old message expected=MessageId(90) received=MessageId(1)
Which means the other peer is sending again messages that were received. This all smell like some logical error in the peer outgoing queue handling, possibly triggered by some connectivity issues.
Right. One peer is sending messages but for some reason things they are never being delivered so tries to resend them. The other peer is actually receiving them and responds, and the responses are being delivered just fine, so things DKG finished just fine.
I've tested iperf3
on the same ports used by fedimint and I get an average of 120 Mbit/s of data transfer in both directions, so it's not a bandwidth issue. Also mtr
shows no dropped packets.
But when running in debug mode I see that unexpected end of file
and Connection reset by peer (os error 104)
errors are causing reconnections.
Increasing MAX_PEER_OUT_OF_ORDER_MESSAGES
and that ack timeout makes DKG finish successfully and move server to VerifyingConfigs
.
Tried again with increased MAX_PEER_OUT_OF_ORDER_MESSAGES
and got:
fedimintd-mutinynet-fedimintd-1 | 2023-07-20T22:03:48.163600Z DEBUG run_io_thread: fedimint_server::net::queue: Received ACK for MessageId(670)
fedimintd-mutinynet-fedimintd-1 | 2023-07-20T22:03:48.178141Z DEBUG run_io_thread: net::peer: Peer disconnected peer=PeerId(1) err=Connection reset by peer (os error 104) disconnect_count=0
fedimintd-mutinynet-fedimintd-1 | 2023-07-20T22:03:48.178154Z DEBUG run_io_thread: net::peer: Scheduling reopening of connection disconnect_count=1 peer=PeerId(1) delay_secs=0.01
fedimintd-mutinynet-fedimintd-1 | 2023-07-20T22:03:48.189240Z DEBUG run_io_thread: net::peer: Trying to reconnect
fedimintd-mutinynet-fedimintd-1 | 2023-07-20T22:03:48.266317Z INFO connection{remote_addr=<REDACTED>:56674 conn_id=3}: timing: Operation timing name=distributed-gen duration_ms=60000
fedimintd-mutinynet-fedimintd-1 | 2023-07-20T22:03:48.388358Z DEBUG rustls::server::hs: decided upon suite TLS13_AES_256_GCM_SHA384
fedimintd-mutinynet-fedimintd-1 | 2023-07-20T22:03:48.392349Z DEBUG run_io_thread: rustls::client::hs: No cached session for DnsName(DnsName(DnsName("peerr")))
fedimintd-mutinynet-fedimintd-1 | 2023-07-20T22:03:48.392393Z DEBUG run_io_thread: rustls::client::hs: Not resuming any session
fedimintd-mutinynet-fedimintd-1 | 2023-07-20T22:03:48.607509Z DEBUG run_io_thread: rustls::client::hs: Using ciphersuite TLS13_AES_256_GCM_SHA384
fedimintd-mutinynet-fedimintd-1 | 2023-07-20T22:03:48.607535Z DEBUG run_io_thread: rustls::client::tls13: Not resuming
fedimintd-mutinynet-fedimintd-1 | 2023-07-20T22:03:48.607653Z DEBUG run_io_thread: rustls::client::tls13: TLS1.3 encrypted extensions: [ServerNameAck]
fedimintd-mutinynet-fedimintd-1 | 2023-07-20T22:03:48.607664Z DEBUG run_io_thread: rustls::client::hs: ALPN protocol is None
fedimintd-mutinynet-fedimintd-1 | 2023-07-20T22:03:48.607678Z DEBUG run_io_thread: rustls::client::tls13: Got CertificateRequest CertificateRequestPayloadTLS13 { context: , extensions: [SignatureAlgorithms([ECDSA_NISTP384_SHA384, ECDSA_NISTP256_SHA256, ED25519, RSA_PSS_SHA
512, RSA_PSS_SHA384, RSA_PSS_SHA256, RSA_PKCS1_SHA512, RSA_PKCS1_SHA384, RSA_PKCS1_SHA256]), AuthorityNames([3010310e300c06035504030c057065657267, 3010310e300c06035504030c057065657272])] }
fedimintd-mutinynet-fedimintd-1 | 2023-07-20T22:03:48.607694Z DEBUG run_io_thread: rustls::client::common: Attempting client auth
fedimintd-mutinynet-fedimintd-1 | 2023-07-20T22:03:48.608246Z DEBUG run_io_thread: net::peer: Received incoming connection peer=PeerId(1) disconnect_count=1 resend_queue_len=125
fedimintd-mutinynet-fedimintd-1 | 2023-07-20T22:03:48.620089Z DEBUG run_io_thread: net::peer: Exiting peer connection IO task - parent disconnected
fedimintd-mutinynet-fedimintd-1 | 2023-07-20T22:03:48.620189Z INFO run_io_thread: net::peer: Shutting down peer connection state machine peer=PeerId(1)
fedimintd-mutinynet-fedimintd-1 | 2023-07-20T22:03:49.474863Z DEBUG rustls::server::hs: decided upon suite TLS13_AES_256_GCM_SHA384
fedimintd-mutinynet-fedimintd-1 | 2023-07-20T22:03:49.672663Z WARN net::peer: Could not send incoming connection to peer io task (possibly banned) peer=PeerId(1)
This got to be some logical error somewhere, not a performance issue. And if it's performance it would rather be latency (some timeout somewhere).
I've created a minimal rust client-server software using tokio's TcpStream/TcpListener and rustls the same way fedimint does for its communication and I get no dropped connections or any type of error on the network layer.
So the problem must be on a higher layer inside fedimint. There are no issues on the hardware or on rustls/tokio level.
@douglaz, @EthnTuttle @m1sterc001guy, @okjodom and me were just attemping federation setup and saw this. In the logs shared below, IIRC @EthnTuttle and @m1sterc001guy had digital ocean servers in New York, I had one in San Francisco, and @okjodom had one in Vancouver. Config API said DKG succeeded for @EthnTuttle and @m1sterc001guy, but failed for me and @okjodom. We had a previous run that failed similarly with @douglaz running a server in India.
My logs: https://gist.github.com/justinmoon/60e205466051d21d9da250294db3ca1d
Debug logging level. My horse for logs with debug level. :D
OK. Initial theory:
The root of all evil is state. Especially mutable and shared. Seems to me like peer communication is using a connection where both sides think they are in control of the connection. So with higher latencies it is possible to get into a sort of a reconnect ping-pong, when one sides reconnects, starts re-sending all messages, and during that time the other side reconnects, and starts sending messages themselves... just to get new connection from the other side...
Anyone with the affected setup would like to try https://github.com/fedimint/fedimint/pull/2841 ?
Anyone with the affected setup would like to try #2841 ?
Works for me!
If I setup a federation using two or more machines in the same datacenter, everything works fine. But if one of the machines is a home computer on another continent, then weird stuff starts to happen. In this example, DKG actually failed.
Note 1: all ports are open and properly forwarded. Communication seems to be okay in both directions. Home internet is 1Gbit/s fiber, so no bandwidth issues besides international link constraints
Note 2: logs abbreviate for repeated or sequential messages by replacing them with
(...)
home computer logs:
Remote datacenter computer: