IPv8/Dispersy dual stack peer explosion

synctext commented 5 years ago

Tribler takes a whole core for nearly 100% after running for days (167h of CPU of htop equal 75% avg. load): screenshot from 2019-01-26 17-41-05 The swp is getting full, maxed out box, but that does not boost user-space CPU usage. Running v7.1.5 mostly idle for 9 days continuously on an Ubuntu 18.04.1 LTS box. Never done a single keyword search, two Fedora seeding (0MB uploaded), but donating encrypted relay bandwidth: screenshot from 2019-01-26 17-33-50 Gathering solid credits:

Reasonable amount of memory, but not growing out of control: screenshot from 2019-01-26 17-35-55

Some amount of errors in the logs:

INFO    1548520097.02                 community:83   (Request)  Timeout for ping to Peer<94.174.45.58:7759, pnK7Ahoxt5xGR/w29hfI5EATiZI=>
INFO    1548520097.70                   handler:152  (TftpHandler)  TFTP[23593 C 12.124.121.166:7759][8b6cd3e1e5dc97efc3cb469631d4980f6dfe13d3.torrent] timed out
WARNING 1548520098.09                  endpoint:56   (UDPEndpoint)  Dropping packet due to socket error: [Errno 11] Resource temporarily unavailable
WARNING 1548520098.09                  endpoint:56   (UDPEndpoint)  Dropping packet due to socket error: [Errno 11] Resource temporarily unavailable
WARNING 1548520098.09                  endpoint:56   (UDPEndpoint)  Dropping packet due to socket error: [Errno 11] Resource temporarily unavailable
WARNING 1548520098.09                  endpoint:56   (UDPEndpoint)  Dropping packet due to socket error: [Errno 11] Resource temporarily unavailable
WARNING 1548520098.09                  endpoint:56   (UDPEndpoint)  Dropping packet due to socket error: [Errno 11] Resource temporarily unavailable
WARNING 1548520098.09                  endpoint:56   (UDPEndpoint)  Dropping packet due to socket error: [Errno 11] Resource temporarily unavailable
WARNING 1548520098.09                  endpoint:56   (UDPEndpoint)  Dropping packet due to socket error: [Errno 11] Resource temporarily unavailable
WARNING 1548520098.09                  endpoint:56   (UDPEndpoint)  Dropping packet due to socket error: [Errno 11] Resource temporarily unavailable
WARNING 1548520098.09                  endpoint:56   (UDPEndpoint)  Dropping packet due to socket error: [Errno 11] Resource temporarily unavailable
WARNING 1548520098.09                  endpoint:56   (UDPEndpoint)  Dropping packet due to socket error: [Errno 11] Resource temporarily unavailable
WARNING 1548520098.09                  endpoint:56   (UDPEndpoint)  Dropping packet due to socket error: [Errno 11] Resource temporarily unavailable
WARNING 1548520098.09                  endpoint:56   (UDPEndpoint)  Dropping packet due to socket error: [Errno 11] Resource temporarily unavailable
WARNING 1548520098.09                  endpoint:56   (UDPEndpoint)  Dropping packet due to socket error: [Errno 11] Resource temporarily unavailable
WARNING 1548520098.09                  endpoint:56   (UDPEndpoint)  Dropping packet due to socket error: [Errno 11] Resource temporarily unavailable
WARNING 1548520098.09                  endpoint:56   (UDPEndpoint)  Dropping packet due to socket error: [Errno 11] Resource temporarily unavailable
WARNING 1548520098.09                  endpoint:56   (UDPEndpoint)  Dropping packet due to socket error: [Errno 11] Resource temporarily unavailable
WARNING 1548520098.09                  endpoint:56   (UDPEndpoint)  Dropping packet due to socket error: [Errno 11] Resource temporarily unavailable
WARNING 1548520098.09                  endpoint:56   (UDPEndpoint)  Dropping packet due to socket error: [Errno 11] Resource temporarily unavailable
WARNING 1548520098.09                  endpoint:56   (UDPEndpoint)  Dropping packet due to socket error: [Errno 11] Resource temporarily unavailable
WARNING 1548520098.09                  endpoint:56   (UDPEndpoint)  Dropping packet due to socket error: [Errno 11] Resource temporarily unavailable
WARNING 1548520098.09                  endpoint:56   (UDPEndpoint)  Dropping packet due to socket error: [Errno 11] Resource temporarily unavailable
WARNING 1548520098.09                  endpoint:56   (UDPEndpoint)  Dropping packet due to socket error: [Errno 11] Resource temporarily unavailable
WARNING 1548520098.09                  endpoint:56   (UDPEndpoint)  Dropping packet due to socket error: [Errno 11] Resource temporarily unavailable
WARNING 1548520098.09                  endpoint:56   (UDPEndpoint)  Dropping packet due to socket error: [Errno 11] Resource temporarily unavailable
WARNING 1548520098.09                  endpoint:56   (UDPEndpoint)  Dropping packet due to socket error: [Errno 11] Resource temporarily unavailable
WARNING 1548520098.09                  endpoint:56   (UDPEndpoint)  Dropping packet due to socket error: [Errno 11] Resource temporarily unavailable
WARNING 1548520098.09                  endpoint:56   (UDPEndpoint)  Dropping packet due to socket error: [Errno 11] Resource temporarily unavailable
WARNING 1548520098.09                  endpoint:56   (UDPEndpoint)  Dropping packet due to socket error: [Errno 11] Resource temporarily unavailable
INFO    1548520098.94                   handler:137  (TftpHandler)  TFTP[34101 C 23.252.56.68:7759][8b6cd3e1e5dc97efc3cb469631d4980f6dfe13d3.torrent] started
INFO    1548520101.04           torrent_checker:151  (TorrentChecker)  Selected 2 new torrents to check on tracker: udp://tracker.xfsub.com:6868
INFO    1548520101.07                   handler:152  (TftpHandler)  TFTP[34101 C 23.252.56.68:7759][8b6cd3e1e5dc97efc3cb469631d4980f6dfe13d3.torrent] timed out
INFO    1548520101.17                   session:419  (UdpTrackerSession)  Error when querying UDP tracker: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.DNSLookupError'>: DNS lookup failed: tracker.xfsub.com.
] udp://tracker.xfsub.com:6868
WARNING 1548520101.17           torrent_checker:262  (TorrentChecker)  Got session error for URL udp://tracker.xfsub.com:6868: [Failure instance: Traceback (failure with no frames): <type 'exceptions.ValueError'>: UDP tracker failed for url udp://tracker.xfsub.com:6868 (error: DNS lookup failed: tracker.xfsub.com.)
]
INFO    1548520102.49                   handler:137  (TftpHandler)  TFTP[1609 C 74.66.207.23:9683][8b6cd3e1e5dc97efc3cb469631d4980f6dfe13d3.torrent] started
WARNING 1548520103.63                   handler:240  (TftpHandler)  got non-existing session from 2.24.234.96:7759, id = 28608
INFO    1548520103.96                 community:83   (Request)  Timeout for ping to Peer<86.126.173.209:19811, F8TPp56do0IvYd8QJ1KyoHijOMg=>
INFO    1548520103.96                 community:83   (Request)  Timeout for ping to Peer<193.165.189.6:26423, rfJDpWM47gT55U9Izwe9R8tGROA=>
INFO    1548520103.96                 community:83   (Request)  Timeout for ping to Peer<5.79.71.187:35035, hiNKfy4uEPKMH9IJ177OKc3zh3c=>
INFO    1548520103.96                 community:83   (Request)  Timeout for ping to Peer<88.21.114.167:7759, s8Ro19O1qVHHvAnAjvAv1IAvbGw=>
INFO    1548520103.96                 community:83   (Request)  Timeout for ping to Peer<70.121.56.5:7759, sLxOpSmBdM0YfABKCIHyZVuAxhk=>
INFO    1548520103.96                 community:83   (Request)  Timeout for ping to Peer<92.20.137.112:55502, r8nBAHqUi9ezNc9TZZHMvTRcrVM=>
INFO    1548520103.96                 community:83   (Request)  Timeout for ping to Peer<81.171.8.17:35200, 7s7HWCaTpTTAi6LSG09/EfZVxRo=>
INFO    1548520103.96                 community:83   (Request)  Timeout for ping to Peer<81.171.8.19:35200, qIqJyrGyUl6VPJIBPmx2FBIZ3Ao=>
INFO    1548520103.96                 community:83   (Request)  Timeout for ping to Peer<81.171.8.18:35160, XO8ZrfspX3GBzL89CMTRQ0YstB8=>
INFO    1548520103.96                 community:83   (Request)  Timeout for ping to Peer<81.171.8.19:35075, PAOwFnIPqursajEC4nwLZgH2Tzw=>
INFO    1548520103.96                 community:83   (Request)  Timeout for ping to Peer<88.89.71.78:7758, o4/QqaweAuzAW85zt/HUQBfQ024=>
INFO    1548520103.97                 community:83   (Request)  Timeout for ping to Peer<185.5.165.113:47892, rWrfbl3Y2xZaRXwwuZk9ZFYjd0I=>
INFO    1548520103.97                 community:83   (Request)  Timeout for ping to Peer<81.171.8.17:35060, vAvMRdf0hPb5xeJ/PTviWHiQHKw=>
INFO    1548520103.97                 community:83   (Request)  Timeout for ping to Peer<81.171.8.17:35010, KLP/0BIzlDf7HiYL2QpV+7oCWHk=>
INFO    1548520103.97                 community:83   (Request)  Timeout for ping to Peer<81.171.27.193:35040, RAVrUhLCsiUUKx+ipKxmX2uV2lQ=>
INFO    1548520103.97                 community:83   (Request)  Timeout for ping to Peer<81.171.27.193:35005, kcfBnK+3zeEGwy8J0LcWurCoZlg=>
INFO    1548520103.97                 community:83   (Request)  Timeout for ping to Peer<37.48.77.235:35010, t07Ao4YBTvmvw6UqH9O7yLx5Mg0=>
INFO    1548520103.97                 community:83   (Request)  Timeout for ping to Peer<37.48.77.237:35035, N+LrqteDPqm2fSi77BOudycsRwI=>
INFO    1548520103.97                 community:83   (Request)  Timeout for ping to Peer<37.48.77.238:35020, E73U8Eh5g1vFIXbMmOq+NqplIRQ=>
INFO    1548520103.97                 community:83   (Request)  Timeout for ping to Peer<95.96.66.166:7759, uNOlxm1LhnBFxlRhwHLzrgG1zQM=>
INFO    1548520103.97                 community:83   (Request)  Timeout for find to Peer<86.126.173.209:19811, F8TPp56do0IvYd8QJ1KyoHijOMg=>
INFO    1548520104.89                   handler:152  (TftpHandler)  TFTP[1609 C 74.66.207.23:9683][8b6cd3e1e5dc97efc3cb469631d4980f6dfe13d3.torrent] timed out
INFO    1548520105.00                   handler:137  (TftpHandler)  TFTP[39475 C 75.169.51.124:10054][8b6cd3e1e5dc97efc3cb469631d4980f6dfe13d3.torrent] started

synctext commented 5 years ago

:fearful: :dizzy_face: :fearful: 2447 peers, what does that even mean? screenshot from 2019-01-26 19-05-59

devos50 commented 5 years ago

Wow, 2447 peers is clearly not ok. If I remember correctly, the number of peers in a community should be capped, especially for non-discovery communities. The max_peers parameters when initializing a community should indicate the maximum number of peers that are in a specific community.

Having so many open connections to others could lead to unexpected behavior. For example, having many peers in the DHT community might make you an attractive target for value lookups, essentially DDOS-ing you. It also seems that the system ran out of file descriptors: [Errno 11] Resource temporarily unavailable.

I think this is a high priority issue that should be fixed before releasing 7.1.6.

qstokkink commented 5 years ago

😨 😵 😨 2447 peers, what does that even mean?

You're keeping 2447 connections to others open.

All peers you gather over 20 are introductions by others, you no longer perform walks yourself at that point. Even by just passively being introduced by others, you have managed to gather quite the following.

To solve this we would need to start kicking out peers from communities. Which in turn may lead to problems with IPv8 deciding to kick someone out you were performing a transaction with (for example in the market).

devos50 commented 5 years ago

I just created the 7.1.6 milestone and assigned some open PRs/issues to it which I think should be included in the release.

qstokkink commented 5 years ago

After sleeping on this, I think it would be best to introduce a max_peers per community. If you hit the max_peers you stop answering introduction requests. This is not ideal, but it will work.

synctext commented 5 years ago

A healthy overlay is essential, could we have a few thousand instances with real churn in an IPv8 integration test? AllChannel experiment tracked this specifically and was bug fixed also a few times. Losing randomness and creating bias is easily done and fatal. A good design and implementation does not need a regulating variable like: do-not-introduce-me-further, I'm full. Other solutions we lose the ability to get answers, makes us blind for bugs and overall system performance.

devos50 commented 5 years ago

@synctext yes we should work towards a large-scale GigaChannel experiment, where we deploy 1000 Tribler instances on the DAS5 which communicate and synchronize channels/torrents with each other. This should be a nightly job. Getting this up and running might take some time, however.

synctext commented 5 years ago

30 sockets opened by core.. screenshot from 2019-01-27 11-31-59

devos50 commented 5 years ago

@synctext in fact, this number shows the total number of open file descriptors, which also includes Python 2.7 library files and database files which are opened for reading. You should see them if you expand the view. Also, there are many processes in Tribler which try to open connections to remote peers (DHT lookups, libtorrent, TFTP, Dispersy, IPv8 etc).

qstokkink commented 5 years ago

Should be solved with the next IPv8 pointer update.

synctext commented 5 years ago

Sorry, overlay is more broken then we maybe realised. After 23hour of uptime on Mac: img_20190128_184048

devos50 commented 5 years ago

@synctext it seems that you are running an older version of Tribler (7.1.4), which does not contain the IPv8 fixes that (should) resolve this issue. You could try to install Tribler from this build (v7.2.0-rc1): https://jenkins-ci.tribler.org/job/Build-Tribler_release/job/Build-Custom/1/ and check whether it correctly resolves the issue?

devos50 commented 5 years ago

Also, I will integrate an IPv8 peer count monitor in the application tester, so we can get overlay statistics over longer periods of time 👍

synctext commented 5 years ago

Running Ubuntu 7.1.5 with a clean megacache for 12hours reproduces this issue. We need a Gumby test for live-edge walking to design a load balancing algorithm (or we determine another root cause of failure): screenshot from 2019-01-29 09-52-20 Small token footprint after donating for 12h:

qstokkink commented 5 years ago

This was only fixed in 7.1.6/7.2.0 as @devos50 pointed out, please reproduce this using the latest version. Also, I sincerely doubt this is due to the live edge walker.

devos50 commented 5 years ago

The number of peers in each community is now monitored and plotted during the application tests. This should help us to debug this kind of overlay behavior.

ipv8_overlays

synctext commented 5 years ago

This is now the last open issue before we can release V7.2. ToDo:

[ ] find root-cause:
- what is going on?
- live edge walker
- shared peer pool
[ ] remove the temporary no-response fix (which impact DHT performance).
[ ] repair with load balancing or something nice.

qstokkink commented 5 years ago

remove the temporary no-response fix (which impact DHT performance).

repair with load balancing or something nice.

I cannot disagree more. This is analogous to (1) unlocking the doors and (2) educating everyone on how bad crime is for society.

Load balancing is not a solution to counter attacks by malicious nodes. This "temporary fix" is our only line of defense against being completely DOS-attacked to death.

synctext commented 5 years ago

7.2.0RC1 has an overlay bug. Statistics after 20minutes uptime: screenshot from 2019-01-29 18-40-23

qstokkink commented 5 years ago

@synctext that's the expected healthy peer count.

devos50 commented 5 years ago

@qstokkink @synctext so are we going to postpone this issue to 7.3?

qstokkink commented 5 years ago

Well the original issue seems under control now. We can make a new issue to explore other means of walking and load balancing and close this one.

synctext commented 5 years ago

@qstokkink Do we still have a (performance) bug? The TrustchainCommunity ID got changed. But there should be several other peers by now? With the v7.2 hitting our frontpage already some time ago, we have 100+ installs that are not finding eachother. A lot of people try out our latest release in hours. All communities are healthy after a while, except 1 : screenshot from 2019-01-31 20-23-23 2 Computer will be online all night. If you can't find them, we have an issue.

devos50 commented 5 years ago

Same issue here:

The amount of peers in the TrustChainCommunity stays on zero, even after some time. This is not captured by our automated tests since they operate in the testnet and for some reason, can find other peers there (see this screenshot).

While this is a bug, having no peers in the TrustChain community is not affecting most of the core functionality of Tribler. Payouts when downloading anonymously are fully handled by the TriblerTunnelCommunity. Direct payouts first do a DHT lookup and contact the peer with the looked up IP address/port directly. The only thing is that users are not actively collecting TrustChain records from other peers now.

My best guess would be that this has something to with the IPv8 trackers not properly introducing peers to each other. This is an issue we should have captured before releasing. In the future, I would suggest to also run the application tester for a short period of time on the 'live' network (so not only on the testnet), plot the IPv8 overlay statistics and check whether Tribler is able to correctly find other peers.

qstokkink commented 5 years ago

Cannot reproduce, I have peers in several seconds running on the release-7.2.0 branch:

afbeelding

Does the issue persist? Do you have any log files?

devos50 commented 5 years ago

Yes, the issue persists and there are no interesting log entries.

Around one hour ago, I switched the tunnel helper processes on leaseweb1 to use the latest commit on the devel branch so you might have found these peers? That can be confirmed with a request to statistics/communities.

@xoriole could you update the remaining tunnel helpers on the other leaseweb machines so they use the latest commit on devel?

synctext commented 5 years ago

It does somewhat find peers after 105 minute, but not many: screenshot from 2019-01-31 21-58-36 (stats say 495 downloads of Github)

synctext commented 5 years ago

Please try to find the root cause of failing for a 7.2.1 if possible Fri.

qstokkink commented 5 years ago

Instantly up to 20 now:

afbeelding

Maybe you guys are just too impatient? 😃

xoriole commented 5 years ago

@devos50 I've restarted all the tunnel_helpers in the leaseweb machines to latest devel commit

devos50 commented 5 years ago

@qstokkink it seems you are in a different community, are you sure that you are running 7.2 released version (since my community ID is different). After a few minutes, I only have one other discovered peer.

xoriole commented 5 years ago

I also have a different community ID (cdecca487745..) and 7 peers in 13 mins. trustchain_ipv8_peers trustchain_uptime

Thomasedv commented 5 years ago

If it helps any... I'm on the 7.2 release (Windows x64 version) and got 22 (up from 20 as it was a bit earlier) peers on the TrustChainCommunity. My master peer is the same as devos and synctext.

qstokkink commented 5 years ago

Hmm.. there are no punctures being sent, this is worrying:

afbeelding

You can compare this to a healthy overlay:

afbeelding

All overlays are puncturing except the TrustChainCommunity.

qstokkink commented 5 years ago

However, of course, TrustChainTestnetCommunity is perfectly fine.

afbeelding

qstokkink commented 5 years ago

Found it:

https://github.com/Tribler/py-ipv8/pull/422/commits/135e93979aee921768fcf3b42850e27fdf0c267a

This would affect any TrustChainCommunity subclasses as well.

qstokkink commented 5 years ago

Nice and healthy on the latest code:

afbeelding

After 2 hours and 10 minutes of uptime.

qstokkink commented 5 years ago

The fix was included in #4193 and should be part of the next release.

synctext commented 5 years ago

Apologies for bringing the bad news (7.2.1 after 13h uptime): screenshot from 2019-02-02 11-07-48 A single core at merely 50%, so holding steady.

qstokkink commented 5 years ago

At first glance, from the screenshot, it would seem that the DHTDiscoveryCommunity keeps piling on new peers. This would make sense, as the DHTDiscoveryCommunity bypasses the IPv8 constraints.

qstokkink commented 5 years ago

I might actually be wrong, cannot reproduce.

Isolated DHTDiscoveryCommunity (Dispersy style)

afbeelding

Non-Isolated DHTDiscoveryCommunity (IPv8-shared-pool style)

afbeelding

qstokkink commented 5 years ago

Still cannot reproduce. Second idea: maybe this is due to this version still sharing a peer pool with Dispersy?

qstokkink commented 5 years ago

Still cannot reproduce after almost an hour:

afbeelding

synctext commented 5 years ago

It seems to very slowly increase with uptime, see 15 hours sample on Linux. Note this is a fully connectable peer! screenshot from 2019-02-02 13-02-08