Closed synctext closed 5 years ago
:fearful: :dizzy_face: :fearful: 2447 peers, what does that even mean?
Wow, 2447 peers is clearly not ok. If I remember correctly, the number of peers in a community should be capped, especially for non-discovery communities. The max_peers
parameters when initializing a community should indicate the maximum number of peers that are in a specific community.
Having so many open connections to others could lead to unexpected behavior. For example, having many peers in the DHT community might make you an attractive target for value lookups, essentially DDOS-ing you. It also seems that the system ran out of file descriptors: [Errno 11] Resource temporarily unavailable
.
I think this is a high priority issue that should be fixed before releasing 7.1.6.
😨 😵 😨 2447 peers, what does that even mean?
You're keeping 2447 connections to others open.
All peers you gather over 20 are introductions by others, you no longer perform walks yourself at that point. Even by just passively being introduced by others, you have managed to gather quite the following.
To solve this we would need to start kicking out peers from communities. Which in turn may lead to problems with IPv8 deciding to kick someone out you were performing a transaction with (for example in the market).
I just created the 7.1.6 milestone and assigned some open PRs/issues to it which I think should be included in the release.
After sleeping on this, I think it would be best to introduce a max_peers
per community. If you hit the max_peers
you stop answering introduction requests. This is not ideal, but it will work.
A healthy overlay is essential, could we have a few thousand instances with real churn in an IPv8 integration test? AllChannel experiment tracked this specifically and was bug fixed also a few times. Losing randomness and creating bias is easily done and fatal. A good design and implementation does not need a regulating variable like: do-not-introduce-me-further, I'm full. Other solutions we lose the ability to get answers, makes us blind for bugs and overall system performance.
@synctext yes we should work towards a large-scale GigaChannel experiment, where we deploy 1000 Tribler instances on the DAS5 which communicate and synchronize channels/torrents with each other. This should be a nightly job. Getting this up and running might take some time, however.
30 sockets opened by core..
@synctext in fact, this number shows the total number of open file descriptors, which also includes Python 2.7 library files and database files which are opened for reading. You should see them if you expand the view. Also, there are many processes in Tribler which try to open connections to remote peers (DHT lookups, libtorrent, TFTP, Dispersy, IPv8 etc).
Should be solved with the next IPv8 pointer update.
Sorry, overlay is more broken then we maybe realised. After 23hour of uptime on Mac:
@synctext it seems that you are running an older version of Tribler (7.1.4), which does not contain the IPv8 fixes that (should) resolve this issue. You could try to install Tribler from this build (v7.2.0-rc1): https://jenkins-ci.tribler.org/job/Build-Tribler_release/job/Build-Custom/1/ and check whether it correctly resolves the issue?
Also, I will integrate an IPv8 peer count monitor in the application tester, so we can get overlay statistics over longer periods of time 👍
Running Ubuntu 7.1.5 with a clean megacache for 12hours reproduces this issue. We need a Gumby test for live-edge walking to design a load balancing algorithm (or we determine another root cause of failure): Small token footprint after donating for 12h:
This was only fixed in 7.1.6/7.2.0 as @devos50 pointed out, please reproduce this using the latest version. Also, I sincerely doubt this is due to the live edge walker.
The number of peers in each community is now monitored and plotted during the application tests. This should help us to debug this kind of overlay behavior.
This is now the last open issue before we can release V7.2. ToDo:
- remove the temporary no-response fix (which impact DHT performance).
- repair with load balancing or something nice.
I cannot disagree more. This is analogous to (1) unlocking the doors and (2) educating everyone on how bad crime is for society.
Load balancing is not a solution to counter attacks by malicious nodes. This "temporary fix" is our only line of defense against being completely DOS-attacked to death.
7.2.0RC1 has an overlay bug. Statistics after 20minutes uptime:
@synctext that's the expected healthy peer count.
@qstokkink @synctext so are we going to postpone this issue to 7.3?
Well the original issue seems under control now. We can make a new issue to explore other means of walking and load balancing and close this one.
@qstokkink Do we still have a (performance) bug? The TrustchainCommunity ID got changed. But there should be several other peers by now? With the v7.2 hitting our frontpage already some time ago, we have 100+ installs that are not finding eachother. A lot of people try out our latest release in hours. All communities are healthy after a while, except 1 : 2 Computer will be online all night. If you can't find them, we have an issue.
Same issue here:
The amount of peers in the TrustChainCommunity
stays on zero, even after some time. This is not captured by our automated tests since they operate in the testnet and for some reason, can find other peers there (see this screenshot).
While this is a bug, having no peers in the TrustChain community is not affecting most of the core functionality of Tribler. Payouts when downloading anonymously are fully handled by the TriblerTunnelCommunity
. Direct payouts first do a DHT lookup and contact the peer with the looked up IP address/port directly. The only thing is that users are not actively collecting TrustChain records from other peers now.
My best guess would be that this has something to with the IPv8 trackers not properly introducing peers to each other. This is an issue we should have captured before releasing. In the future, I would suggest to also run the application tester for a short period of time on the 'live' network (so not only on the testnet), plot the IPv8 overlay statistics and check whether Tribler is able to correctly find other peers.
Cannot reproduce, I have peers in several seconds running on the release-7.2.0
branch:
Does the issue persist? Do you have any log files?
Yes, the issue persists and there are no interesting log entries.
Around one hour ago, I switched the tunnel helper processes on leaseweb1
to use the latest commit on the devel
branch so you might have found these peers? That can be confirmed with a request to statistics/communities
.
@xoriole could you update the remaining tunnel helpers on the other leaseweb machines so they use the latest commit on devel
?
It does somewhat find peers after 105 minute, but not many: (stats say 495 downloads of Github)
Please try to find the root cause of failing for a 7.2.1 if possible Fri.
Instantly up to 20 now:
Maybe you guys are just too impatient? 😃
@devos50 I've restarted all the tunnel_helpers in the leaseweb machines to latest devel commit
@qstokkink it seems you are in a different community, are you sure that you are running 7.2 released version (since my community ID is different). After a few minutes, I only have one other discovered peer.
I also have a different community ID (cdecca487745..
) and 7 peers in 13 mins.
If it helps any... I'm on the 7.2 release (Windows x64 version) and got 22 (up from 20 as it was a bit earlier) peers on the TrustChainCommunity. My master peer is the same as devos and synctext.
Hmm.. there are no punctures being sent, this is worrying:
You can compare this to a healthy overlay:
All overlays are puncturing except the TrustChainCommunity.
However, of course, TrustChainTestnetCommunity
is perfectly fine.
Found it:
https://github.com/Tribler/py-ipv8/pull/422/commits/135e93979aee921768fcf3b42850e27fdf0c267a
This would affect any TrustChainCommunity subclasses as well.
Nice and healthy on the latest code:
After 2 hours and 10 minutes of uptime.
The fix was included in #4193 and should be part of the next release.
Apologies for bringing the bad news (7.2.1 after 13h uptime): A single core at merely 50%, so holding steady.
At first glance, from the screenshot, it would seem that the DHTDiscoveryCommunity keeps piling on new peers. This would make sense, as the DHTDiscoveryCommunity bypasses the IPv8 constraints.
I might actually be wrong, cannot reproduce.
Still cannot reproduce. Second idea: maybe this is due to this version still sharing a peer pool with Dispersy?
Still cannot reproduce after almost an hour:
It seems to very slowly increase with uptime, see 15 hours sample on Linux. Note this is a fully connectable peer!
Over two hours in (don't mind the down, I'm performing an anon. download):
Are we sure the correct IPv8 version was shipped with 7.2.1
?
@synctext are you running the .deb
file?
Just over 3 hours in, still nothing out of the ordinary.
Status update 4 hours in, still nothing:
Captain's log, 5 hours in:
Calling it a day, can't reproduce:
I'm sure the latest IPv8 version is shipped with 7.2.1. I just started running Tribler today, will see how it'll progress overnight.
Tribler takes a whole core for nearly 100% after running for days (167h of CPU of htop equal 75% avg. load): The swp is getting full, maxed out box, but that does not boost user-space CPU usage. Running v7.1.5 mostly idle for 9 days continuously on an Ubuntu 18.04.1 LTS box. Never done a single keyword search, two Fedora seeding (0MB uploaded), but donating encrypted relay bandwidth: Gathering solid credits:
Reasonable amount of memory, but not growing out of control:
Some amount of errors in the logs: