feat(p2p): Track peers' connection failures and rank accordingly

LePremierHomme commented 3 years ago

Closing https://github.com/ExchangeUnion/xud/issues/1970.

Few remarks:

The offline/dead terminology wasn't used. Instead, consecutiveConnFailure counter was added to known nodes, and a ranking function which works in accordance with some thresholds.
Peer session can fail even if raw connection was successful (in handshake or validation). These failures suppose to affect the peer's reputation. The current ranking function doesn't consider that, but rather only the consecutiveConnFailure counter (it's possible to change that).
consecutiveConnFailure cannot be correlated to time, so simpler thresholds (200 & 1500) than what was specified in https://github.com/ExchangeUnion/xud/issues/1970#issuecomment-733626904 were used.
If no potential peers found in the first threshold (200) group, the second's (1500) group is used. A more sophisticated adjustments can be made, but I just went with the simplest.

kilrau commented 3 years ago

NICE!

* Peer session can fail even if raw connection was successful (in handshake or validation). These failures suppose to affect the peer's reputation. The current ranking function doesn't consider that, but rather only the `consecutiveConnFailure` counter (it's possible to change that).

You didn't say why you decided to make the consecutiveConnFailure counter account for these connection failures, but not the ranking function. Anyhow, if it adds significant compliexity to make the ranking function account for these failures, we can leave as is. If not we should make it account for it to be consistent imo.

If no potential peers found in the first threshold (200) group, the second's (1500) group is used.

Is this on startup (first connection attempt) or 2 onwards? The second group (1500) was the one we defined as dead and I don't think it should ever be considered for connection. What we can do in case the first group is empty: try to connect to seed nodes ignoring their status

kilrau commented 3 years ago

Rest sounds good!

LePremierHomme commented 3 years ago

You didn't say why you decided to make the consecutiveConnFailure counter account for these connection failures, but not the ranking function. Anyhow, if it adds significant compliexity to make the ranking function account for these failures, we can leave as is. If not we should make it account for it to be consistent imo.

consecutiveConnFailure counts raw connection failures, but not general peer session failures. The ranking function currently considers only consecutiveConnFailure, but it can consider also other failures, by using the peer reputation score. It’s easy to implement. (We can also just make the raw connection failures as another reputation event, but I thought that keeping it separate might be worthwhile.)

Is this on startup (first connection attempt) or 2 onwards? The second group (1500) was the one we defined as dead and I don't think it should ever be considered for connection. What we can do in case the first group is empty: try to connect to seed nodes ignoring their status

The first group is < 200 (good peers), the second group is < 1500 (“offline” peers). The rest (“dead” peers) are ignored. See here: https://github.com/ExchangeUnion/xud/pull/2009/files#diff-1046c174ecc568579870d80a7fb8d8125eca7011b25e73deba9153915e5b4185R77

kilrau commented 3 years ago

It’s easy to implement

Then I'd say let's do it!

(We can also just make the raw connection failures as another reputation event, but I thought that keeping it separate might be worthwhile.)

Yep, let's keep these separate for now.

The first group is < 200 (good peers), the second group is < 1500 (“offline” peers). The rest (“dead” peers) are ignored. See here: https://github.com/ExchangeUnion/xud/pull/2009/files#diff-1046c174ecc568579870d80a7fb8d8125eca7011b25e73deba9153915e5b4185R77

Gotcha, all good then. Thx for pointing me at it!

erkarl commented 3 years ago

Can we rebase this branch against master? Looks like there are some extra commits here (change password).

kilrau commented 3 years ago

Can we rebase this branch against master? Looks like there are some extra commits here (change password).

:+1: @LePremierHomme

kilrau commented 3 years ago

when i disabled my wifi, waited ~7 mins and checked listpeers I still got peers (but im not online) - does not look like good behavior, because if i have peers i have orders from them (but i was not able to swap with them because i have no internet);

After wifi enabling i waited ~7 mins and checked listpeers after that -> i lost one of four peers - OzoneYellow that connected through clearnet IP instead of tor (xud1 peer in simnet network), i waited 10 additional minutes and checked listpeers after that, it still was not in list (but another my node has it); I got it back only after restart; Screenshot from 2020-12-01 16-23-17

also after restart i got the case when one my node has another my node in listpeers but another one has no this node in the list (after 5 min of wifi enabling); Screenshot from 2020-12-01 16-49-58

I agree that we want to fix all of these, but probably none of it is related to this PR. Did your run these test cases on master to verify? @raladev If they reproduce on master just the same, I suggest to design a decent set of "p2p reconnection" test cases. We still have tons of issues with p2p reconnections, e.g. https://github.com/ExchangeUnion/xud/issues/1862 or https://github.com/ExchangeUnion/xud/issues/1699 but all of them are hard to reproduce and thus hard to fix for @LePremierHomme

Your thoughts? @LePremierHomme

raladev commented 3 years ago

@raladev If they reproduce on master just the same,

I reproduced them on both versions

So can anyone please describe how P2P connection establishing / connection supporting / connection restoring cases work in xud?

I suggest to design a decent set of "p2p reconnection" test cases.

I can do that, but i need to know how it should work in ideal way )

kilrau commented 3 years ago

I reproduced them on both versions

Alright, then it's clear that they are not related to this PR and shouldn't be in here. I opened https://github.com/ExchangeUnion/xud/issues/2010 - let's start collecting the test cases there once @LePremierHomme confirmed how it this should behave. I'll add my 2cts after this.

sangaman commented 3 years ago

I tried to test it but faced some strange cases during the testing.

1. when i disabled my wifi, waited ~7 mins and checked listpeers I still got peers (but im not online) - does not look like good behavior, because if i have peers i have orders from them (but i was not able to swap with them because i have no internet);

2. After wifi enabling i waited ~7 mins and checked listpeers after that -> i lost one of four peers - OzoneYellow that connected through clearnet IP instead of tor (xud1 peer in simnet network), i waited 10 additional minutes and checked listpeers after that, it still was not in list (but another my node has it); I got it back only after restart;
   [Screenshot from 2020-12-01 16-23-17](https://user-images.githubusercontent.com/29906866/100750561-43101c00-33f7-11eb-81ea-9ab04ca099b2.png)

3. also after restart i got the case when one my node has another my node in listpeers but another one has no this node in the list (after 5 min of wifi enabling);
   [Screenshot from 2020-12-01 16-49-58](https://user-images.githubusercontent.com/29906866/100750492-296ed480-33f7-11eb-8613-2de2dec005bf.png)

So can anyone please describe how P2P connection establishing / connection supporting / connection restoring cases work in xud? @sangaman @LePremierHomme

I'm opening a separate issue as this is definitely a concern but not related to any of the changes in this PR. Edit: Never mind, I saw Kilian already opened one so I'll follow up there.

LePremierHomme commented 3 years ago

@kilrau

It’s easy to implement

Then I'd say let's do it!

See https://github.com/ExchangeUnion/xud/pull/2009/commits/8e7ba3180766589fd93fd43d7e18604130a68917. Feel free to suggest changes.

kilrau commented 3 years ago

Code & original concept look good to me, however I'm unsure about the newly added logic to make peers secondary (or make us not connect at all). For one thing, we are defacto silently "banning" peers if they are outbound and we don't have listening addresses, because under the secondary threshold we stop automatically connecting to them. But beyond that, I thought the purpose of this PR was to deal with a flood of bad connection attempts, but a low reputation score doesn't necessarily mean we'll have any trouble connecting to the peer.

We just discussed this on the daily call and this behaviour is fine as-is, since we eventually want to stop connecting to permanently dead peers (and there are quite some) and our default setups all have valid listening addresses, so if a long-dead peer comes back to life after several weeks or months, it will reconnect to us.

What I actually thought @kilrau was suggesting originally was to make 200/1500 consecutive failed connection attempts a negative reputation event that lowers a node's score. That makes a little more sense to me (we don't want peers that are habitually unreachable over the long run), although I don't think it's necessary in this PR.

We discussed this too, hopefully a simple change: remove reputation events from this PR. (Reason: we are not sure if reconnection attempts/down-time should be immediately tied to reputation)

LePremierHomme commented 3 years ago

We discussed this too, hopefully a simple change: remove reputation events from this PR. (Reason: we are not sure if reconnection attempts/down-time should be immediately tied to reputation)

Connection failures are currently not tied to reputation, but ranking does consider both connection failures and reputation. Are suggesting to have ranking consider only the connection failures (as before my last commit)?

kilrau commented 3 years ago

Are suggesting to have ranking consider only the connection failures (as before my last commit)?

Yes!

erkarl commented 3 years ago

Think we can get this in after rebase? @sangaman

ExchangeUnion / xud

feat(p2p): Track peers' connection failures and rank accordingly #2009