ExchangeUnion / xud

Exchange Union Daemon 🔁 ⚡️
https://exchangeunion.com
GNU Affero General Public License v3.0
115 stars 44 forks source link

not connected to peer after disconnect #616

Closed offerm closed 5 years ago

offerm commented 5 years ago

My node was up connected to 3 peers and have peer orders.

wifi down for 3 minutes, wifi back up.

peers are missing (no reconnect).

kilrau commented 5 years ago

Maybe add logs @offerm

offerm commented 5 years ago

Not really needed.

This is not a bug - we don't [YET] have the code that suppose to do it (IMHO).

Log only shows the disconnects and that is it.

On Tue, Oct 30, 2018 at 10:34 AM Kilian Rausch ⚡️ notifications@github.com wrote:

Maybe add logs @offerm https://github.com/offerm

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ExchangeUnion/xud/issues/616#issuecomment-434214211, or mute the thread https://github.com/notifications/unsubscribe-auth/AJQ0cnA7YiaRCDqaYa5k0tXVFLwoWuaAks5uqA8jgaJpZM4YBZTX .

kilrau commented 5 years ago

Hmm shouldn't https://github.com/ExchangeUnion/xud/pull/392 (https://github.com/ExchangeUnion/xud/issues/311) catch it where we defined a connection retry logic for a momentarily unavailable peer?

moshababo commented 5 years ago

@kilrau #392 is for the initial connection retries, not after disconnection. We talked before about re-connecting after a disconnection, but decided not to do so.

This specific disconnection happens because peer is stalling ping/pong response. It doesn't make sense to define re-connection from the side which closed the connection. Instead we can check network connectivity before closing peers for stalling responses.

But the other side might close the connection as well, and we won't know about it until we're back online. In this case re-connecting make sense. To implement this, we need to define specific cases for reconnecting after peer closed the connection, because in some cases it can be considered as DoS.

To do this, we first need to implement #152. I expect that we'll get this packet when we'll be back online, but we need to verify it.

sangaman commented 5 years ago

Would the idea then be to attempt a reconnection after receiving a DISCONNECTING packet? And I'm guessing the packet should specify that the reason for the disconnection is because our node became unresponsive?

offerm commented 5 years ago

Kind of inventing the wheel here, no?

See how this started, I had a wifi issue, not a disconnect by a remote peer.

IMHO, if I connected to a peer and disconnected, there should be a simple retry mechanism that try to connect in a similar way we are doing when LND disconnected.

Now, if the connection was rejected (reject package) by a peer there is no need to try an reconnect.

IMHO, peer rejection and peer banning have some time... wifi problems and even closing my my laptop going to sleep are here already.

On Wed, Oct 31, 2018 at 4:14 PM Daniel McNally notifications@github.com wrote:

Would the idea then be to attempt a reconnection after receiving a DISCONNECTING packet? And I'm guessing the packet should specify that the reason for the disconnection is because our node became unresponsive?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ExchangeUnion/xud/issues/616#issuecomment-434703054, or mute the thread https://github.com/notifications/unsubscribe-auth/AJQ0cleSG_WnjE2KoTVZT6_Un0nH4B4Gks5uqbBPgaJpZM4YBZTX .

sangaman commented 5 years ago

I'm not 100% sure but as I understand it the TCP socket will not necessarily close if you lose wifi connectivity temporarily, but your peers will see that you have become unresponsive and disconnect from you. So the actual socket connection would be closed by the peer in this case. Correct me if I'm wrong.

offerm commented 5 years ago

TCP socket will close when TCP layers notice it. If you have a short drop that is recovered quickly, TCP me recover without the peers notice. If TCP detected the problem and fail to revoced the peer will not be able to recover.

So, when see the error on the application level, it is a done deal.

On Wed, Oct 31, 2018 at 4:54 PM Daniel McNally notifications@github.com wrote:

I'm not 100% sure but as I understand it the TCP socket will not necessarily close if you lose wifi connectivity temporarily, but your peers will see that you have become unresponsive and disconnect from you. So the actual socket connection would be closed by the peer in this case. Correct me if I'm wrong.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ExchangeUnion/xud/issues/616#issuecomment-434717561, or mute the thread https://github.com/notifications/unsubscribe-auth/AJQ0ciRx5JuGrI7qPTMwfIbNAP7d3dPlks5uqbmQgaJpZM4YBZTX .

sangaman commented 5 years ago

Thinking about this some more, your node would likewise see that pings are failing and would disconnect from the peer even if it is technically you that are having connectivity issues. So I'm thinking we wouldn't receive a DISCONNECTING packet in either case.

My question is, how are we going to tell the difference between ourselves going temporarily offline and a peer going offline? We can also attempt pings to google.com or something , but that wouldn't always work if we are on a LAN or something.

If we detect that we've lost connectivity and then regained it, we can just attempt to reconnect to all nodes like we do on startup.

Another approach would be to just apply the same retry logic we use during initial connection to a node anytime we disconnect to a node due to loss of connectivity and packets timing out.

offerm commented 5 years ago

Which disconnect packet do you expect to received? When you disconnect it is final. No need to look for the reason.

Just attempt to reconnect with the peer (if you created the connection or if you know how to connect it (not using the temporaty port).

Just as LND is doing :-)

On Wed, Oct 31, 2018 at 5:29 PM Daniel McNally notifications@github.com wrote:

Thinking about this some more, your node would likewise see that pings are failing and would disconnect from the peer even if it is technically you that are having connectivity issues. So I'm thinking we wouldn't receive a DISCONNECTING packet in either case.

My question is, how are we going to tell the difference between ourselves going temporarily offline and a peer going offline? We can also attempt pings to google.com or something , but that wouldn't always work if we are on a LAN or something.

If we detect that we've lost connectivity and then regained it, we can just attempt to reconnect to all nodes like we do on startup.

Another approach would be to just apply the same retry logic we use during initial connection to a node anytime we disconnect to a node due to loss of connectivity and packets timing out.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ExchangeUnion/xud/issues/616#issuecomment-434729940, or mute the thread https://github.com/notifications/unsubscribe-auth/AJQ0cgXtNCCqhoM0pnWqfevWixGkQMymks5uqcG_gaJpZM4YBZTX .

sangaman commented 5 years ago

We don't have a disconnection packet currently, it's just an issue/idea. The reason is if the peer intentionally disconnected with you, it's not really related to this topic.

I'd be fine with automatically attempting to reconnect if we outgoing connection information, seems like the simplest solution.

moshababo commented 5 years ago

@offerm according to you’re suggestion, if i’m getting a socket close event, i’ll just try to reconnect. it means that the peer cannot chose to disconnect from me, because every time he tries to, i’m reconnecting.

@sangaman If we can't detect our connectivity, we can't detect if packets are timing out due to stalling, and we can only assume it's due to loss of connectivity.

I'm not 100% sure but as I understand it the TCP socket will not necessarily close if you lose wifi connectivity temporarily, but your peers will see that you have become unresponsive and disconnect from you. So the actual socket connection would be closed by the peer in this case. Correct me if I'm wrong.

correct, and in this case you'll probably get the PING/DISCONNECTING packets & socket close event when you’ll be back online. but i'm not sure we can rely on that.

offerm commented 5 years ago

@moshababo For socket close (TCP/IP), and disconnect due to timeouts and missing pings, we for sure should try to reconnect. if you want a peer to "reject a connection" that should be done by a dedicated packet on the XUD level. If a peer is just stopping, we should try to reconnect until it is back up.

Only time when we are not trying to reconnect is after we got this special purpose packet that should basically block the peer. That packet should not be used by the peer when the peer is going down.

We should also use that packet during handshake it the peer is banned.

kilrau commented 5 years ago

Anything that speaks against this? @moshababo @sangaman

moshababo commented 5 years ago

special purpose packet

@offerm Is there something in what you’r suggesting which differs from what’s specified in #152, besides the latter being more general-purpose?

So the idea is:

Do you agree? @kilrau @offerm @sangaman

offerm commented 5 years ago

I don't see a need for stalling packet. Just close the connection and retry.

I would try to reconnect also after the last point. The time between reconnect attempt should reset only after successful handshake.

I wouldn't implement the #152 packet until the situation is stable.

kilrau commented 5 years ago

Let's assume we are still in @offerm's scenario from above. His laptop connected to one of our cloud servers. Wifi issues, connection breaks on tcp level. When he comes back online, I am having a hard time to imagine how he even gets a #152 packet from his peers if the connection was closed on tcp level, we cannot rely on it to still be open as you wrote above. The stalling packet makes sense for some advanced use cases like "I'm going to ban you know because of reason XYZ" as described in #152 but not here I think. So I agree with @offerm to wait with #152 .

In short: If I go offline, come back online and notice my sockets are closed, I'll initiate the reconnect from my side to all my previous peers.

if peer closed the socket without sending the #152 packet or HELLO, don't attempt to reconnect (peer doesn't follow the protocol), but don't ban (he might recover)

That's exactly the situation from a peers point of view, when I loose my wifi connection. It's perfectly fine to do nothing here because as described above I will initiate the reconnect which is much more efficient.

moshababo commented 5 years ago

@offerm

I would try to reconnect also after the last point. The time between reconnect attempt should reset only after successful handshake.

can you clarify this?

--

My assumption is that we don't know our network connectivity status. so if the socket got closed, I don't know whether it's because I was offline and stalling. If it's because I got banned, and i'll try to reconnect, it's a problem because we'll get a feedback loop.

So without #152 packet, I see it reasonable to implement only the first bullet from my suggestion:

if peer is stalling responses, xud will send the #152 packet (for stalling), close the socket, and immediately attempt to reconnect (with all the retry logic from #311). It means that we're trying to reconnect after we actively closed the socket - a bit weird, but we can live with that

I think this would be fine as we are expected to inspect the peer stalling responses before the TCP socket will timeout. After closing the socket, and reconnecting, we might still be offline, but connection retries will trigger as well.

kilrau commented 5 years ago

I agree on the necessity for #152 to implement this properly. Since handling TCP layer connectivity issues isn't focus currently, I suggest to do dependency #152 first and then pick up this one in the next milestone. In the meanwhile we try to find a consensus on what we expect from the implementation of this issue.

Summary by @moshababo with some fixes:

EDIT: some of the more complicated #152 behavior moved to new issue https://github.com/ExchangeUnion/xud/issues/693

moshababo commented 5 years ago

So the basic thing is that if the socket got closed but we didn't receive #152 - we reconnect (assume network issue). Otherwise if we did received the #152, we reconnect on specific cases.

Sounds fine to me.

moshababo commented 5 years ago

For banning a node solely for outgoing connections, we can simply delete the node from our repository. But if we want to keep it there for historical data, we need to add a new banning mode. We can look into it before implementing though.