Connection issues analysis

dpad85 commented 1 year ago

I've been able to reproduce a connection issue on my mainnet Android device where

The device suddenly gets very bad Internet connectivity. All apps are affected, including Phoenix.
connection to the Peer is lost
connection to Electrum is not. We still receive the new blocks (and even receive them right on time, I was checking in real time), however pings do time-out.

I've edit the logs so they are very easy to read: https://bin.idrix.fr/?365fbd76444eadc4#wcppPRpmo4YaVRze82nU4bCuYURoPeqvo6scuxbccSo

There are several issues:

The Peer connection should not stay in ESTABLISHING state so long. Note that when that the internet connection is fixed, the peer only gets disconnected 20s later (probably by the remote peer) and only then we try to reconnect. This makes for a frustrating UX.
Electrum connection soldiers on and remains ESTABLISHED, even when pings fail. Pings have no effect.
The two points above means that the user will think that there's an issue with the ACINQ node.

Another issue that will have to be fixed in the Phoenix repo:

The exponential backoff in the AppConnectionDaemon connection loop is not working as expected (the pause should be reset at some point, instead of incrementing mindlessly: in the end when all the issues above were fixed, this loop still waited for 2s which was not useful).

t-bast commented 1 year ago

It's very surprising that pings to the electrum server timeout but we still receive block, it looks like TCP sockets are only inactive on the sending side (probably some OS-related weirdness).

1, The Peer connection should not stay in ESTABLISHING state so long.

It does reflect reality though, as we are trying to connect but not succeeding yet? If we added a timeout, we would just end up DISCONNECTED instead of ESTABLISHING. Would this be a better UX for the user? Do you want that because you can then trigger the reconnection from the Phoenix side once you get a notification from the OS that connectivity is back?

It would fix the 20 seconds timeout though, and it's trivial to implement, so it's probably worth adding anyway.

Electrum connection soldiers on and remains ESTABLISHED, even when pings fail. Pings have no effect.

I'm not sure what we should do here. Pings are only here to help the TCP connection stay alive (it's just a TCP keep-alive): if they fail but the connection is still alive, we have achieved our goal, so there's nothing to do? If we decided to disconnect when pings timeout, then we would most likely fail to reconnect: the result is that electrum would also appear disconnected. Is that what we want? I see why we could want that for UX reasons (even though it's wasteful to disconnect and reconnect when we have a viable TCP connection).

Another issue with automatically disconnecting on pings timeout is that I don't know if electrum servers guarantee that they will answer to ping messages. If we start disconnecting valid connections that don't respond to ping but correctly respond to everything else, that would really be bad...

dpad85 commented 1 year ago

it looks like TCP sockets are only inactive on the sending side (probably some OS-related weirdness).

Could be, the connectivity issue started when I uploaded a large log file from my device to Gmail.

If we added a timeout, we would just end up DISCONNECTED instead of ESTABLISHING. Would this be a better UX for the user?

Yes I think so. I don't know exactly why, but connection attempts sometimes hang for a while. During that time, the UI is stuck, it looks like a bug. Forcing reconnection (by restarting the app) usually fixes the issue. A good UX would be:

I try to connect for 5s
it fails
I pause for a few seconds, and then try again (with exp. backoff)
the UI keeps track of this, so I'm not worried. There's also a button to reset the pause and try immediately if I'm in a hurry.

^ this would need work on Phoenix's UI as well

I'm not sure what we should do here. Pings are only here to help the TCP connection stay alive (it's just a TCP keep-alive): if they fail but the connection is still alive, we have achieved our goal, so there's nothing to do?

I agree, I'm 100% fine with keeping the Electrum connection alive as long as possible. However, pinging Electrum seems unreliable, so we could:

simply remove the pings altogether, but it's risky, there could be side-effects;
decrease the log level of ping errors to info or debug.

t-bast commented 1 year ago

Yes I think so. I don't know exactly why, but connection attempts sometimes hang for a while. During that time, the UI is stuck, it looks like a bug. Forcing reconnection (by restarting the app) usually fixes the issue. A good UX would be:

Sounds good to me. The timeout (5s) should be provided by Phoenix, so that Phoenix can check whether the user has activated Tor or not and adapt the timeout value accordingly. Ideally if the OS gives you details about the connection quality, that should also be taken into account when choosing what timeout value to use.

decrease the log level of ping errors to info or debug.

Definitely, we should do that!

ACINQ / lightning-kmp

Connection issues analysis #531