Closed LePremierHomme closed 3 years ago
NICE!
* Peer session can fail even if raw connection was successful (in handshake or validation). These failures suppose to affect the peer's reputation. The current ranking function doesn't consider that, but rather only the `consecutiveConnFailure` counter (it's possible to change that).
You didn't say why you decided to make the consecutiveConnFailure
counter account for these connection failures, but not the ranking function. Anyhow, if it adds significant compliexity to make the ranking function account for these failures, we can leave as is. If not we should make it account for it to be consistent imo.
If no potential peers found in the first threshold (200) group, the second's (1500) group is used.
Is this on startup (first connection attempt) or 2 onwards? The second group (1500) was the one we defined as dead
and I don't think it should ever be considered for connection. What we can do in case the first group is empty: try to connect to seed nodes ignoring their status
Rest sounds good!
You didn't say why you decided to make theĀ consecutiveConnFailureĀ counter account for these connection failures, but not the ranking function. Anyhow, if it adds significant compliexity to make the ranking function account for these failures, we can leave as is. If not we should make it account for it to be consistent imo.
consecutiveConnFailure
counts raw connection failures, but not general peer session failures. The ranking function currently considers only consecutiveConnFailure
, but it can consider also other failures, by using the peer reputation score. Itās easy to implement. (We can also just make the raw connection failures as another reputation event, but I thought that keeping it separate might be worthwhile.)
Is this on startup (first connection attempt) or 2 onwards? The second group (1500) was the one we defined asĀ deadĀ and I don't think it should ever be considered for connection. What weĀ canĀ do in case the first group is empty: try to connect toĀ seed nodesĀ ignoring their status
The first group is < 200 (good peers), the second group is < 1500 (āofflineā peers). The rest (ādeadā peers) are ignored. See here: https://github.com/ExchangeUnion/xud/pull/2009/files#diff-1046c174ecc568579870d80a7fb8d8125eca7011b25e73deba9153915e5b4185R77
Itās easy to implement
Then I'd say let's do it!
(We can also just make the raw connection failures as another reputation event, but I thought that keeping it separate might be worthwhile.)
Yep, let's keep these separate for now.
The first group is < 200 (good peers), the second group is < 1500 (āofflineā peers). The rest (ādeadā peers) are ignored. See here: https://github.com/ExchangeUnion/xud/pull/2009/files#diff-1046c174ecc568579870d80a7fb8d8125eca7011b25e73deba9153915e5b4185R77
Gotcha, all good then. Thx for pointing me at it!
Can we rebase this branch against master? Looks like there are some extra commits here (change password).
Can we rebase this branch against master? Looks like there are some extra commits here (change password).
:+1: @LePremierHomme
when i disabled my wifi, waited ~7 mins and checked listpeers I still got peers (but im not online) - does not look like good behavior, because if i have peers i have orders from them (but i was not able to swap with them because i have no internet);
After wifi enabling i waited ~7 mins and checked listpeers after that -> i lost one of four peers - OzoneYellow that connected through clearnet IP instead of tor (xud1 peer in simnet network), i waited 10 additional minutes and checked listpeers after that, it still was not in list (but another my node has it); I got it back only after restart; Screenshot from 2020-12-01 16-23-17
also after restart i got the case when one my node has another my node in listpeers but another one has no this node in the list (after 5 min of wifi enabling); Screenshot from 2020-12-01 16-49-58
I agree that we want to fix all of these, but probably none of it is related to this PR. Did your run these test cases on master to verify? @raladev If they reproduce on master just the same, I suggest to design a decent set of "p2p reconnection" test cases. We still have tons of issues with p2p reconnections, e.g. https://github.com/ExchangeUnion/xud/issues/1862 or https://github.com/ExchangeUnion/xud/issues/1699 but all of them are hard to reproduce and thus hard to fix for @LePremierHomme
Your thoughts? @LePremierHomme
@raladev If they reproduce on master just the same,
I reproduced them on both versions
So can anyone please describe how P2P connection establishing / connection supporting / connection restoring cases work in xud?
I suggest to design a decent set of "p2p reconnection" test cases.
I can do that, but i need to know how it should work in ideal way )
I reproduced them on both versions
Alright, then it's clear that they are not related to this PR and shouldn't be in here. I opened https://github.com/ExchangeUnion/xud/issues/2010 - let's start collecting the test cases there once @LePremierHomme confirmed how it this should behave. I'll add my 2cts after this.
I tried to test it but faced some strange cases during the testing.
1. when i disabled my wifi, waited ~7 mins and checked listpeers I still got peers (but im not online) - does not look like good behavior, because if i have peers i have orders from them (but i was not able to swap with them because i have no internet); 2. After wifi enabling i waited ~7 mins and checked listpeers after that -> i lost one of four peers - OzoneYellow that connected through clearnet IP instead of tor (xud1 peer in simnet network), i waited 10 additional minutes and checked listpeers after that, it still was not in list (but another my node has it); I got it back only after restart; [Screenshot from 2020-12-01 16-23-17](https://user-images.githubusercontent.com/29906866/100750561-43101c00-33f7-11eb-81ea-9ab04ca099b2.png) 3. also after restart i got the case when one my node has another my node in listpeers but another one has no this node in the list (after 5 min of wifi enabling); [Screenshot from 2020-12-01 16-49-58](https://user-images.githubusercontent.com/29906866/100750492-296ed480-33f7-11eb-8613-2de2dec005bf.png)
So can anyone please describe how P2P connection establishing / connection supporting / connection restoring cases work in xud? @sangaman @LePremierHomme
I'm opening a separate issue as this is definitely a concern but not related to any of the changes in this PR. Edit: Never mind, I saw Kilian already opened one so I'll follow up there.
@kilrau
Itās easy to implement
Then I'd say let's do it!
See https://github.com/ExchangeUnion/xud/pull/2009/commits/8e7ba3180766589fd93fd43d7e18604130a68917. Feel free to suggest changes.
Code & original concept look good to me, however I'm unsure about the newly added logic to make peers secondary (or make us not connect at all). For one thing, we are defacto silently "banning" peers if they are outbound and we don't have listening addresses, because under the secondary threshold we stop automatically connecting to them. But beyond that, I thought the purpose of this PR was to deal with a flood of bad connection attempts, but a low reputation score doesn't necessarily mean we'll have any trouble connecting to the peer.
We just discussed this on the daily call and this behaviour is fine as-is, since we eventually want to stop connecting to permanently dead peers (and there are quite some) and our default setups all have valid listening addresses, so if a long-dead peer comes back to life after several weeks or months, it will reconnect to us.
What I actually thought @kilrau was suggesting originally was to make 200/1500 consecutive failed connection attempts a negative reputation event that lowers a node's score. That makes a little more sense to me (we don't want peers that are habitually unreachable over the long run), although I don't think it's necessary in this PR.
We discussed this too, hopefully a simple change: remove reputation events from this PR. (Reason: we are not sure if reconnection attempts/down-time should be immediately tied to reputation)
We discussed this too, hopefully a simple change: remove reputation events from this PR. (Reason: we are not sure if reconnection attempts/down-time should be immediately tied to reputation)
Connection failures are currently not tied to reputation, but ranking does consider both connection failures and reputation. Are suggesting to have ranking consider only the connection failures (as before my last commit)?
Are suggesting to have ranking consider only the connection failures (as before my last commit)?
Yes!
Think we can get this in after rebase? @sangaman
Closing https://github.com/ExchangeUnion/xud/issues/1970.
Few remarks:
offline
/dead
terminology wasn't used. Instead,consecutiveConnFailure
counter was added to known nodes, and a ranking function which works in accordance with some thresholds.consecutiveConnFailure
counter (it's possible to change that).consecutiveConnFailure
cannot be correlated to time, so simpler thresholds (200 & 1500) than what was specified in https://github.com/ExchangeUnion/xud/issues/1970#issuecomment-733626904 were used.