Tribler / tribler

Privacy enhanced BitTorrent client with P2P content discovery
https://www.tribler.org
GNU General Public License v3.0
4.81k stars 446 forks source link

Downloads dropping to zero bytes/sec #3379

Closed synctext closed 5 years ago

synctext commented 6 years ago

To reproduce:

devos50 commented 6 years ago

We haven't been able to accurately reproduce this so far. Let's postpone it to 7.1.

Dmole commented 6 years ago

Maybe related; The thread count (as reported by Activity Monitor) increases up to ~1000 threads for ~1 peer then drops down to ~500 or ~200, and repeats. Only on the lower PID (I think that's the UI process). Debug>System>threads only lists 14 though (likely some of the 26 in the core process).

synctext commented 6 years ago

Brainstorming on a long-term solution... Having a list of 100 magnet links of http://torrent.ubuntu.com/releases/artful/release/server/, or http://bt.etree.org/index.php?sort=seeders then see if this breaks things. Just download them all 100 concurrently.

synctext commented 6 years ago

Reproduced... Does a whopping 4MByte/sec one month, then "abrupt stop". Now shows the core not responding or something.

screen shot 2018-03-28 at 17 11 39
synctext commented 6 years ago

Another observation using standard top on the "abrupt stop" issue.

Normally this Mac has lots of threads running, in nothing running state. The Tribler core process below only seems to use 21 #ports. When the Tribler core is 90% in the stuck state, it's clear we're in trouble. Thus we have an alternative for a continuous keep-alive ping to check the core.

screen shot 2018-03-30 at 00 04 00
devos50 commented 6 years ago

@xoriole @synctext is this issue still valid?

Dmole commented 6 years ago

Sorry All my Apple products run Linux now.

devos50 commented 6 years ago

I think I managed to (somewhat) reproduce this with the application tester. After running the application tester for 16 hours on our Mac, I observed the following:

speed_down

Note that it could also be because the downloads are completed and the application tester is not removing them correctly. Unfortunately, I didn't monitor the progress of the downloads so I extend the application tester to do so.

devos50 commented 6 years ago

The next step to debug this issue, as discussed yesterday, is to download some torrents without anonymity and plot the same graph. This requires some changes to the application tester.

synctext commented 6 years ago

Mac performance of 7.1.0 RC1 release is very impressive. Trying to crash it, but did not succeed yet. CPU usage basically collapsed by upgrading from the Beta to RC1! Tribler no longer consumes the majority of a CPU core. Old Mac laptop did not even kick in the fan. This has never happened. Next is to even remove the 50 max. socket connections in the settings. See direct download in 3 large swarms. Please continue with the application tester, we need more soak testing and multi-platform test coverage.

After 13 years of trying, it seems that we finally have something that is becoming usable.

screen shot 2018-09-12 at 22 13 55
synctext commented 5 years ago

This is causing trouble for our users. Escalating this to top-priority, as it severly impact usability. https://forum.tribler.org/t/tribler-starts-up-finds-seeders-starts-downloading-and-after-2-minutes-stops-and-all-connections-are-down-to-0/4247/2

xoriole commented 5 years ago

Some observations:

synctext commented 5 years ago

No, bug is when connected to 50+ peers. Running smooth, then sudden drop to speed of 0.

xoriole commented 5 years ago

Yes, GUI does show a high number of seeds/peers but the actual libtorrent peers slowly reduces to zero. An example of this case is shown in this screenshot below. large-swarm

devos50 commented 5 years ago

@Captain-Coder also noticed this bug; he will also try to reproduce/investigate it 👍

Dmole commented 5 years ago

synctext, xoriole is correct the 349 peer count is # of peers found, the list of 0 is the for currently connected peers. I listed some mitigating features here; https://github.com/Tribler/tribler/issues/3857

synctext commented 5 years ago

Progress! Tested custom build. TCP-only is better. Still very eradic speeds. This might be a case of CPU hogging. Then utp packet get delivered late, measured with high latency and agressive throttle kicks in. Please try to reproduce on Jenkins at some point.

UPDATE: incorrect, the TCP-only version also has bandwidth-crash problem.

devos50 commented 5 years ago

Interesting that TCP-only works better. At this point, it might be helpful to make a libtorrent-only package (with PyInstaller), start a single download using that library, and compare the single-download speed to that of a download in Tribler. This could reveal whether the problem is in libtorrent or Tribler (I suspect the latter).

devos50 commented 5 years ago

Changed the title of this issue to make it more specific. Also, it does not only happen on Mac, but also on Windows/Linux (as reported on the forum).

synctext commented 5 years ago

ToDo: please create infrastructure to reproduce. Have all our 1 Gbps servers and few laptops seed a 5GB swarm. Both 1 swarm with no trackers and 1 swarm with trackers.

devos50 commented 5 years ago

Created a minimal libtorrent application to hopefully reproduce this bug (code can be found here). The binaries can be found on Jenkins (MacOS and Windows 64-bit). I've started this application on our Mac Tester, it is running for 1.5h+ now without a drop in download speed.

TODO: make sure that the libtorrent session configuration is as close as possible to the Tribler one.

qstokkink commented 5 years ago

Has been running about 10 minutes on Windows before freeze.

synctext commented 5 years ago

What is dmg or something in these files? Libtorrentminimal.dms has issues

synctext commented 5 years ago

Running the app. Not stuck, but very sensitive to wifi changes. Bizar! Bandwidth collapses if you move the laptop, even if you still have 4 full signal strength bars. Recovers in 20-30 seconds.

devos50 commented 5 years ago

Hmm that's interesting. Do you see the console output? I'm printing alerts of the error category, but I will change the app so we can see the performance alerts. It might help us to get a bit more insight in what's going on.

Today, the app was running for over 4 hours without any problems on the Mac tester (i.e. downloads never dropped to zero for unknown reasons, except when the download got removed and started again). One thing I would like to add, is that it writes the download speed to a separate file, which can then be plotted with this R script. Next, run it on all platforms we have, for prolonged periods of time. Also, we need to bundle the .torrent files since the request to http://releases.ubuntu.com sometimes fails, resulting in a download state we cannot recover from.

devos50 commented 5 years ago

Also, running the application is a bit non-trivial. You should download the archive from Jenkins (which includes the .app file), navigate in the .app file itself, go to the MacOS directory and run the libtorrentminimal binary from the terminal. When executing the binary, you should specify the path where you want to save the torrents as the first program argument.

synctext commented 5 years ago

Downloaded and reproduced this with latest v7.1.2 version of Tribler. As requested I've downloaded two files and tried to see the effect when reproducing the error. Standard Ubuntu download with plenty of seeds has a "bandwidth crash" from 5.0 MByte/sec to zero. First the health mode:

screenshot_20181029-204651

Then a live capture of the after the crash kicks in (did not manage to capture the crash itself):

https://photos.app.goo.gl/DZau4bevHH7WPt819

xoriole commented 5 years ago

It is different from what I've observed so far. In my case, when the download speed was dropping, the number of connected seeders was dropping with it and there were zero connected peers when download speed became 0kB/s. But what you have is quite different and interesting. The seeders are still connected but not contributing. This gives a feeling that this Tribler peer is being choked by all the connected seeders at the same time which I think is less likely.

synctext commented 5 years ago

Finally reproduced! Seems downloading to USB fails. After the developers dived deeper it seems even naked Libtorrent stalls when writing to an external USB stick. Facinating.

ichorid commented 5 years ago

Seems like most USB drives have a real problem with random writes: usb

synctext commented 5 years ago

if you understand when it fails, Tribler is amazing.. Conducted an experiment to understand if the USB bottleneck theory holds. It is a Windows-compatible large format USB drive, mounted on Mac. In the past I've seen that on Mac the shutdown can take over 30minutes. My understanding was that Tribler has gone in some error state, went corrupt, ran out of file descriptors, something. Nope, it's merely 30minutes of backlog USB stick pending writes; it seems. Experiment:

@ichorid @xoriole Please try to make a reproducible slow IO testing ecosystems for this within Jenkins/nosetest. Do we need something complex like ionice or trickle+NBD?

ichorid commented 5 years ago

cgroups with blkio should be enough for the task of simulating slow disk environment, at least on Linux.

ichorid commented 5 years ago

BTW, the related issue https://github.com/arvidn/libtorrent/issues/1997 was still open at the beginning of this year.

Captain-Coder commented 5 years ago

I experience a drop to zero also, but I expect that in my situation it is actually a different problem. I run on a virtual server (with no USB drives attached), I've checked that I don't run into I/O limits. Infact, iotop shows only idle I/O. I can semi-reliably observe a crash to zero after some 1h to 1h45min after starting tribler. I however my downloads are always stuck at 0 and don't recover. This seems more like what @xoriole described.

I managed to get this log segment of around the time that it happend. DropToZeroLog.txt

Notice that in the beginning there are circuits to send the DHT requests, but after some tunnels get closed, there is no longer a tunnel to send the request on. And TriblerTunnelCommunity keeps telling us it wants zero tunnels.

$ curl -X GET http://localhost:8080/debug/circuits {"circuits": []}

Indeed there are zero tunnels, no tunnels are being built, but there are active downloads.

If I add another download, the tunnel community is triggered to build some more tunnels and downloading resumes on all downloads. After 10 mins this new tunnel is closed due to age and there are 0 tunnels and multiple active downloads again. I expect this is somewhere in the create_engine_wrapper or that huge tunnel community callback that monitors download. The version I used here is recent but not bleeding edge, so please someone tell me if something like this was recently fixed. I'm going to add some debug log lines to debug further.

devos50 commented 5 years ago

@Captain-Coder this is a very interesting observation!

So there are several things here that might lead to this issue. From the log:

2019-01-15T13:20:31+0000 [stderr#error] INFO:TriblerTunnelCommunity:Want 0 data circuits of length 1
2019-01-15T13:20:31+0000 [stderr#error] INFO:TriblerTunnelCommunity:Removing circuit 1682128877 too old
2019-01-15T13:20:32+0000 [stderr#error] INFO:TriblerTunnelCommunity:Removed circuit 1682128877 too old

So it is removing a circuit, and pops it from self.circuits. Five seconds later, the do_circuits is called again and it should create a new circuit. The log says:

2019-01-15T13:20:36+0000 [stderr#error] INFO:TriblerTunnelCommunity:Want 0 data circuits of length 1

So if we look at the code in the do_circuits method, it implies that either settings.max_circuits <= 8 or we have at least 8 circuits already. @egbertbouman any ideas?

I suspect that the DHT lookups are messing with a counter or circuit creation? Tribler should always try to maintain at least 8 circuits at once, for each circuit length.

synctext commented 5 years ago

"TriblerTunnelCommunity keeps telling us it wants zero tunnels" Thnx for insight into this top-priority bug! Great use-case for soak testing e2e anon downloads.

Captain-Coder commented 5 years ago

I added a debug statement to do_circuits. It shows circuits_needed going to 0. There is only one place in the code where this assignment is made triblertunnel/community.py:on_download_removed(). But it is only made 0 when num_hops_by_downloads goes to 0.

I use Tribler headless. So I add downloads by bash scripts & curl, and I have a systemd timer that runs a check for completed downloads and removes them. And the problem occured some 10 minutes after a completed download was removed and the last tunnels have expired. So that coincides with the on_download_removed() code.

So how does the num_hops_by_downloads counter work? It should count the number of active downloads for a given number of hops. It gets reduced when a download is removed. And it gets incremented... only once. This happens in build_tunnels(). Which in turn gets called only once, from deep within can_create_engine_wrapper(). Yes that big mess.

The can_create_engine_wrapper() first checks if there is a need to build tunnels, and then calls build_tunnels(). If there are preexisting tunnels (like a download already going) build_tunnels() will be skipped, num_hops_by_downloads won't be incremented and is no longer accurate. This in turn causes the circuits_needed to go to 0 when a download is removed.

This quite far removed from what @synctext originally described, so perhaps this is a similar issue that should get it's own ticket.

The obvious flaw is that build_tunnels should not be expected to be called exactly once per download. This begs the question of why build_tunnels is bothering to keep track of downloads anyway, it is in the ipv8 part of the tunnel community. The only reason why num_hops_by_downloads is there is for the tunnels_ready() function, which probaly should look to circuits_needed and not num_hops_by_downloads. I think that this could be resolved by moving num_hops_by_downloads to triblertunnel/community.py and finding a suitable place for the increment, or not bothering with keeping track exactly and just set it as a whole in the monitor_downloads() callback. I'd do this, but that is a PR over 2 repos, I have no time and I need to write experiments and text.