Understanding the impact of latency on Libtorrent thoughput

synctext commented 7 years ago

Problem: Tor-like tunnels introduce significant latency.

Measure how 25ms to 1000ms of latency affects the Libtorrent throughput. Aim to understand how LEDBAT parameters can be set to maximize throughput. Create an experimental environment using containers. Use netem to create latency between two containers. Start seeder at one container, download across the connecting bridge with added latency.

Initial results: performance is severly efected by latency of just 50 ms.

synctext commented 7 years ago

Current operational measurement scripts: https://github.com/vandenheuvel/libtorrent-latency-benchmark

MaxVanDeursen commented 7 years ago

averages

Measurements of the libtorrent download speed under different latencies with 6 seeders.

vandenheuvel commented 7 years ago

This is relevant for achieving onion routing with defence against traffic confirmation attacks.

synctext commented 7 years ago

ToDo for next meet, from Libtorrent API docs + tuning:

Understand performance tuning and experiment with various settings
Compile from source
session_settings high_performance_seed();
outstanding_request_limit_reached
send_buffer_watermark_too_low
Directly add a test for latency sensitivity in Libtorrent existing software emulation suite?

vandenheuvel commented 7 years ago

Investigate max_out_request_queue, and here.

synctext commented 7 years ago

Background http://blog.libtorrent.org/2015/07/slow-start/ and http://blog.libtorrent.org/2011/11/requesting-pieces/ Easy fix: high_performance_seed returns settings optimized for a seed box, serving many peers and that doesn't do any downloading. It has a 128 MB disk cache and has a limit of 400 files in its file pool. It support fast upload rates by allowing large send buffers.

Additional boost: asynchronous disk I/O

Detailed solution : I’m unable to get more than 20Mbps with a single peer on a 140ms RTT link (simulated delay with no packet loss).. Original post

things you could adjust according to Arvid Norberg lead engineer of the libtorrent project.

“Did you increase the socket buffer sizes on both ends?”

int recv_socket_buffer_size;
int send_socket_buffer_size;
“There’s also buffer sizes at the bittorrent level:”

int send_buffer_low_watermark;
int send_buffer_watermark;
int send_buffer_watermark_factor;
“And there are buffers at the disk layer:”

int max_queued_disk_bytes;
int max_queued_disk_bytes_low_watermark;

vandenheuvel commented 7 years ago

New test using LXC's. Ten seeding LXC's, one downloading LXC. Single measurement. While high latencies seem to start slower, high latencies seem to do substantially better once transfer speed has stabilized. Latencies up to 150 ms perform, at maximum speed, similar to the base test without any latency. Measurement without any latency is very similar to earlier test using VMs. result

vandenheuvel commented 7 years ago

A single seeding LXC. Single measurement. Higher latencies impact throughput heavily.

synctext commented 7 years ago

ToDo, try to obtain more resilience against latency in Libtorrent with single seeder, single leecher. Plus read current research on traffic correlation attacks. The basics are covered here. Quote: 'recent stuff is downright scary, like Steven Murdoch's PET 2007 paper about achieving high confidence in a correlation attack despite seeing only 1 in 2000 packets on each side'.

synctext commented 7 years ago

Strange observation that it takes 60 to 100 seconds for speed to pick up. is the seeder side the bottleneck, due to anti-freeriding stuff? Please repeat multiple times and create boxplots.

1 iteration to find the magic bottleneck...

vandenheuvel commented 7 years ago

Good news: It appears that the magic bottleneck is identified. Plot of single seeder, single leecher, 200ms latency. No reordering. default Throughput is mostly 15MB/s. Now with doubling the default and max parameters of net.ipv4.tcp_rmem and net.ipv4.tcp_wmem (measured in bytes): double_default_max_plain We notice that throughput doubles also, to roughly 30MB/s. The bad news, however, is that further increasing these parameters has little effect: most of throughput speeds never pass the 35MB/s. Also, the inconsistency in these measurements is still unexplained.

synctext commented 7 years ago

good! next bottleneck...

synctext commented 7 years ago

The magic parameter setting are discovered now, resulting in 35MByte/s Libtorrent throughput. Next steps are to advance this throughput in Tribler and Tribler+tor-like circuits.

20170110_144810

Please document your LXC containers. ToDo: Tribler 1 seeder, 1 leecher; see influence of blocking SQLite writes on performance...

@qstokkink understands the details of tunnel community...

qstokkink commented 7 years ago

You will probably want a repeatable experiment using the Gumby framework. You are in luck, I created a fix for the tunnel tests just a few weeks ago: https://github.com/qstokkink/gumby/tree/fix_hiddenseeding . You can use that branch to create your own experiment, extending the hiddenservices_client.py HiddenServicesClient with your own experiment and your own scenario file (1 seeder, 1 leecher -> see this example for inspiration).

Once you have the experiment running (which is a good first step before you start modifying things - you will probably run into missing packages/libraries etc.), you can edit the TunnelCommunity class in the Tribler project.

If you want to add delay to:

Intermittent relaying nodes, add delay here
Exit nodes, add delay here
The sending node, add delay here

You can combine relaying nodes and the sending node into one by adding delay here (which does not include the exit node)

synctext commented 7 years ago

1 Tribler seed + 1 Tribler leecher with normal Bittorrent with 200ms - 750ms latency is limited by congestion control. Read the details here. To push Tribler towards 35MB/s with added Tor-like relays we probably at some point this year need to see the internal state of the congestion control loop.

ToDo for 2017: measure congestion window (cwnd) statistics during hidden seeding.

vandenheuvel commented 7 years ago

We managed to do some new tests. We ran Tribler with the only the http API and libtorrent enabled. We found that the performance of libtorrent within Tribler is significantly worse than plain libtorrent. Below is a summary of results so far. Note that these values for a single seeder single leecher test.

	libtorrent	Tribler
no latency	~160 MB/s	30 - 100 MB/s
200 ms	~15 MB/s	~2.5 MB/s
200 ms + magic	~35 MB/s	~2.5 MB/s

Note that "magic" is the increasing of net.ipv4.tcp_rmem and net.ipv4.tcp_wmem parameters. It appears that Tribler suffers from a different bottleneck. Note that when testing without latency, speed varies heavily between ~30 MB/s and ~100 MB/s. During the all tests, cpu load was ~25% on 3 cores. zerolatency4gb

200mslatency4gbmagic @synctext @devos50 @qstokkink does anyone have any ideas what might cause this?

qstokkink commented 7 years ago

@vandenheuvel Assuming all libtorrent versions are the same etc.: the only way Tribler interfaces with a libtorrent download is by retrieving its stats every second (handle.stats()) and by handling alerts.

After writing that I found this in the libtorrent manual:

Note

these calls are potentially expensive and won't scale well with lots of torrents. If you're concerned about performance, consider using post_torrent_updates() instead.

Even though this shouldn't be that bad, you could try and write a loop which gets the torrent handle status every second on your naked experiment and see how that affects things

devos50 commented 7 years ago

@vandenheuvel in addition, we are also processing all libtorrent alerts every second but I don't think this leads to much overhead actually. Could you try to disable the alert processing (by commenting out this line: https://github.com/Tribler/tribler/blob/devel/Tribler/Core/Libtorrent/LibtorrentMgr.py#L72)?

synctext commented 7 years ago

Very impressive work guys:

	libtorrent	Tribler
no latency	~160 MB/s	30 - 100 MB/s
200 ms	~15 MB/s	~2.5 MB/s
200 ms + magic	~35 MB/s	~2.5 MB/s

Tribler shamefully collapses. Clearly something to dive into! Did the tip of fully disabling the stats+perhaps 5 second stats sampling lead to any results? btw, can you also expand this Tribler pain table with 10, 25, and 75ms latency data points?

MaxVanDeursen commented 7 years ago

Our test results show a download speed of ~2.5 MB/s at 200 ms with our plain script as well, when we introduce the EPoll reactor into the code. This is similar to the results found in the previous tests with Tribler. However, tests with our plain script and Select reactor shows the original results that we have retrieved before introducing the reactor or even higher: a top speed of 30 MB/s. The next thing on our list is testing Tribler through twisted with the Select reactor.

qstokkink commented 7 years ago

Interesting. Could you also try the normal Poll reactor (it is supposed to be faster than Select for large socket counts)?

vandenheuvel commented 7 years ago

Strange enough, results are quite different between our script and Tribler. Summary:

	`EPollReactor`	`SelectReactor`	`PollReactor`
script	2.5 MB/s	32 MB/s	16 MB/s
`Tribler`	2.5 MB/s	2.5 MB/s	2.5 MB/s

It may take a little while for the download to come up to speed (~ 60 seconds), but after that the throughput is quite steady.

Our next step will be profiling.

synctext commented 7 years ago

32MByte / sec. So.. Python3 and 200ms latency results. This facinating mistery deepens. Please make a profile print with human readable threadname printouts

vandenheuvel commented 7 years ago

We just ran our script under both python2.7 and python3.5, this made no difference for the SelectReactor.

MaxVanDeursen commented 7 years ago

Due to an update for lxc our test results have changed drastically. The newest test results, using a latency of 200 ms except otherwise mentioned are:

	`No Reactor without delay`	`No Reactor`
Script	~400 MB/s	~32 MB/s

All the below results are created by a modified script as well (200 ms):

	`EPollReactor`	`SelectReactor`	`PollReactor`
Inactive	~32 MB/s	~16 MB/s	~16 MB/s
Semi-Active	~32 MB/s	~16 MB/s	~16 MB/s

Notes:

The EPollReactor and PollReactor only reach their speed after a certain period of time.

synctext commented 7 years ago

hmmm. so the lower non-script table is all Tribler?

vandenheuvel commented 7 years ago

In the above post, all results are our own script. We retested everything non-Tribler. We're not sure what this change of results (especially the No Reactor without delay peek performance and EPollReactor results) tells us about the quality of current testing method... These changes are enormous.

vandenheuvel commented 7 years ago

	`EPollReactor`	`SelectReactor`	`PollReactor`
`Tribler 0 ms`	~100 MB/s	~130 MB/s	~80 MB/s
`Tribler 200 ms`	2.5 MB/s	2.5 MB/s	2.5 MB/s

PollReactor without latency was varying wildly, the other measurements were steady. Sadly these results agree with previous results, before the LXC update. We will now try to bring our script and Tribler closer together by using the reactor thread to start libtorrent in our script.

synctext commented 7 years ago

impressive exploration. Good steps to the final goal: Tribler becomes latency tolerant, add big buffer, and remix anon tunnel packets.

synctext commented 7 years ago

please read

Two Cents for Strong Anonymity: The Anonymous Post-office Protocol "AnonPoP offers strong anonymity against strong, globally eavesdropping adversaries, that may also control multiple AnonPoP’s servers,even allbut-one servers in a mix-cascade."

synctext commented 7 years ago

Todays conclusion: Tribler drops factor 40 in performance when latency is added between single seeder, single downloader (tested with 4GByte file and 200ms latency). Factor 25 drop with pure Libtorrent plus Twisted (no Tribler) when latency is added there.

Tribler magically can start libtorrent from Twisted, however this fails to be reproducible. Feedback from Twisted developer and contact with Arvid: http://twistedmatrix.com/trac/ticket/9044 http://stackoverflow.com/questions/42351473/using-libtorrent-session-with-twisted-stuck-on-checking-resume-data

ToDo: refactor the core.libtorrent package. Next possible step: performance regression suite (#1287, but not now please) Factor 25-40 stuff is essential

synctext commented 7 years ago

@captain-coder remarked that touching Python at all from non-Twisted thread will cause trouble. Possibly the thing you see.

vandenheuvel commented 7 years ago

@synctext @Captain-Coder What exactly do you mean by that?

Captain-Coder commented 7 years ago

@synctext butchered what I said to him.

He described the problem: as starting an OS thread in process, having that thread talk to libtorrent, and hosting python in the same process to do "stuff". Where I remarked that as soon as the python bindings for libtorrent are loaded into this process (which happens pretty quickly if you import/use/touch Tribler in the hosted python instance) it might induce a situation where python structures are modified, through callbacks/pointers from libtorrent, initiated from the native OS thread. This violates the properties that the python GIL is trying to impose and has a good chance of tripping up the python interpreter (or even worse, silently corrupting it).

But it's also possible synctext described the problem wrong.

egbertbouman commented 7 years ago

@vandenheuvel @MaxVanDeursen I just fixed the problem in your twisted branch. Now the leecher/seeder is no longer stuck in CHECKING_RESUME_DATA. The fix is:

    def printSpeed(h):
        for alert in ses.pop_alerts():
            status = h.status()
            print 'seeder', status.state, status.progress, alert

You can set the alerts using session.set_alert_mask. I guess you should be using alerts until the problem is fixed in libtorrent/twisted.

MaxVanDeursen commented 7 years ago

@egbertbouman Wow... Thanks a lot for that! How did you come to this solution? And do you happen to know why this works?

egbertbouman commented 7 years ago

@MaxVanDeursen You're welcome! Since libtorrent worked in Tribler, I just looked for the differences between the Tribler implementation and yours. I have no idea why this works.

vandenheuvel commented 7 years ago

Thanks to @egbertbouman we have some new results.

	`EPollReactor`	`SelectReactor`	`PollReactor`
`Script, starting session with Twisted 0 ms`	~400 MB/s	~400 MB/s	~400 MB/s
`Script, starting session with Twisted 200 ms`	32 MB/s	16 MB/s	16 MB/s

We conclude that it doesn't matter which thread starts the session. We will start rewriting the Core.Libtorrent package now: this way we can learn more about how Tribler interacts with libtorrent and hopefully discover why Tribler is so slow.

vandenheuvel commented 7 years ago

We did some profiling and found that under both the EPollReactor and the SelectReactor, close to 100% of the time was spent blocking in the poll of the reactors. This seemed normal as Tribler was basically idle during these tests (except for downloading using Libtorrent at ~2.5 MB/s).

egbertbouman commented 7 years ago

Yesterday, I spent some time testing Tribler vs libtorrent. I used Ubuntu 16.10 in virtualbox. For libtorrent only I got: result

Then, I changed the code in the master branch to work with Tribler and got this: before

Finally. I removed the flags keyword parameter from ltsession = lt.session(lt.fingerprint(*fingerprint), flags=1), and like magic: after

The difference seems to be that Tribler is not loading some of the default libtorrent features (like UPnP, NAT-PMP, LSD). Very strange...

MaxVanDeursen commented 7 years ago

@egbertbouman Thanks for these hopeful results! Unfortunately, we have not been able to reproduce your results on our own system. However, we have noticed that using this flag in our basic script, the throughput of the program is negatively impacted to 2.5MB/s as well. We will look further into why the removal of this flag does not result in a change of throughput for us, although you have shown that this can be done.

synctext commented 7 years ago

so uTP is enabled, but lt.fingerprint(*fingerprint), flags=1 what does that do?

MaxVanDeursen commented 7 years ago

@synctext From the Libtorrent Manual:

If the fingerprint in the first overload is omited, the client will get a default fingerprint stating the version of libtorrent. The fingerprint is a short string that will be used in the peer-id to identify the client and the client's version ... The flags paramater can be used to start default features (upnp & nat-pmp) and default plugins (ut_metadata, ut_pex and smart_ban). The default is to start those things. If you do not want them to start, pass 0 as the flags parameter.

MaxVanDeursen commented 7 years ago

After the results of last week we decided to do exhaustive tests on all combinations of our script and Tribler, as well with and without flags.

Seeder has different script than downloader in table below. Diagonal is equal code. All have 200ms latency.

Leecher \ Seeder	Script NoFlag	Tribler NoFlag	Tribler Flag	Script Flag
Script NoFlag	H	L	H	L
Tribler NoFlag	L	L	L	L
Tribler Flag	L	L	L	L
Script Flag	L	L	L	L

L(ow): convergence of download speed at the rate of ~2.5MB/s H(igh): convergence of download speed at the rate way above 2.5MB/s (i.e. >10MB/s)

From these results there is nothing in plain sight that we can conclude as far as the usage of this flag goes.

synctext commented 7 years ago

Please consider at some point to open up the magic box and dive into these detailed networking statistics. Something is off...

synctext commented 7 years ago

Key problem: discover why Tribler has 2.5 MByte/sec performance with 200ms seeder-leecher latency, and clean Twisted-wrapped Libtorrent has 16MByte/sec.

The methodology is to strip the Python wrapping code and use a methodology of stumbling upon success. Need to be realistic and systematically do a brute-force exploration why it doesn't work. Finding the magic fix directly is not realistic at this point. Stripped Libtorrent manager now 40 lines of code, still 2.5 MByte/sec. Dispersy is off in Tribler.

Is it the database that blocks the whole Twisted thread for 25ms regularly?

synctext commented 7 years ago

possible experiment to decide if the bottleneck is within the code or parameter settings or both:

Tribler stripped Libtorrent manager, which can only resume download
Libtorrent manager in clean Python, fast
Pauzed torrent, parameters created by Tribler, slow
Pauzed torrent, parameters created by Libtorrent clean code, fast

Mix and match both ways to see where the fault lies...

arvidn commented 7 years ago

I just read through (most) of this thread. Here are some observations:

the libtorrent python bindings never call back into python from within libtorrent. Early versions attempted to do this, by supporting python extensions. This turned to cause subtle memory corruption or GIL deadlocks. So, libtorrent is not supposed to interact with python other than through the direct API calls. (so, if you find anything like that, please report a bug!)
libtorrent has built-in instrumentation added specifically to troubleshoot performance issues. In versions before 1.1 it's a build switch (TORRENT_STATS), in 1.1 and later the stats are reported as an alert (which is disabled by default). If these alerts are printed to a log, it can be analyzed by tools/parse_session_stats.py in the libtorrent repo (that script exists pre 1.1 too, but loads the files produced by libtorrent. The gotcha is that the stats files are written to CWD, which needs to be writable by the process). The script requires gnuplot and will render a number of graphs and hook them up to an html page. Looking at them may reveal something.
The meaning of the flags passed to the session constructor are defined here. By default both flags are set (add_default_plugins and start_default_features). By passing in 1, you just add the plugins, without starting DHT, UPnP, NAT-PMP and local peer discovery.

I also have some questions:

Are you testing with TCP or uTP? The uTP implementation in libtorrent could be made to perform better on linux by using sendmmsg() and recvmmsg(), but currently it requires a system call per packet sent or received.
which version of libtorrent are you using? If you're on 1.1.x, one way to have more control over how the session is set up is to use the constructor that takes a settings_pack. This lets you setup the configuration before starting the session.
Is the main issue you're looking at the performance difference between your twisted wrapper and Tribler? Or are you looking at understanding the bottlenecks making the 200ms case have so much lower throughput than the 0 ms case? If the latter, would it be useful to you if I would whip up a one-to-one test transfer simulation with various latencies?

vandenheuvel commented 7 years ago

@arvidn Thanks a lot for these points! You can have a look at our code here.

We're using the latest version in the Ubuntu repositories. Note that we're only trying to explain the difference in performance between Tribler and our script. Thus, calls like sendmmsg() and recvmmsg() don't seem that relevant as we're not doing these calls (explicitly) in our fast script.

Using settings_pack's might be helpful, but right now we're trying to replicate Tribler's settings in our script to achieve the same performance. @MaxVanDeursen maybe we should just refactor Tribler's code to use settings_pack's as a first libtorrent refactor step and hope that performance improves?

Yes, we're looking to explain this difference under the condition of 200 ms, as the difference becomes larger (from 4x to 8x using default settings) with respect to the 0 ms case. This latency (and above) is relevant for tunnel community performance, as we suspect this is the current bottleneck.

For our script (at 200 ms), we already know what the bottleneck is: buffer sizes (see above). Right now, we believe it is most probable that this has to do with settings passed to libtorrent either in creating the libtorrent session or adding a torrent.

vandenheuvel commented 7 years ago

Report can be found at overleaf.

Tribler / tribler

Understanding the impact of latency on Libtorrent thoughput #2620