Tribler / tribler

Privacy enhanced BitTorrent client with P2P content discovery
https://www.tribler.org
GNU General Public License v3.0
4.83k stars 447 forks source link

slow anonymous downloads: Crypto CPU bottleneck #1882

Closed synctext closed 6 years ago

synctext commented 8 years ago

The CPU seams to be the reason for slow <1 MByte/sec anonymous downloads.

possible problem Running crypto on twisted thread blocks all other Tribler activity. Unclear if we needs 256bit GCM mode. Anything that checks a signature, decrypt a message, etc. needs to be traced.

possible long-term solution Separate thread for the tunnel community or even an isolated thread for each community. Low hanging fruit: parallel execution of relaying, make it multi-threaded. Real threads: Twisted reactor versus import multiprocessing...

goal 1) benchmark raw openssl C code GCM MBps.

2) Create a minimal benchmark comparing current situation in Tribler with alternatives. Not re-using Tribler code, but a performance test processing 10.000 UDP

EDIT: use 10k UDP packets through exit node as benchmark.

whirm commented 8 years ago

@synctext I don't think this is related to 6.6

synctext commented 8 years ago

Download performance insight

Within the roadmap for high-performance privacy it is important to identify bottlenecks. We do not know if the CPU, IO, database IO, buffering, event handling, queue waiting, threading, or networking is the bottleneck.

The following experiment determines the crypto performance of the HiddenTunnelCommunity at the downloading side.

Step-by-step we build and measure the raw networking throughput expressed in MByte/second in various setups. First is the minimal Dispersy NULL receiver, dropping packets immediately coming from several senders. Next is receiving the packet on the network thread and creating a Twisted event to handle this packet; again by simply dropping it. Third is using the on-data approach of the hidden community. Fourth experiment is benchmarking a single layer of Tor decryption at the receiver. Finally, the full tunnel community. image

synctext commented 8 years ago

Connected to: https://github.com/Tribler/tribler/issues/2137, #2215 #2216, #2217

Note: 4th item for thesis: torrent checker dramatic improvements/repair.

synctext commented 8 years ago

Approach: our exit node helpers are maxed out in CPU. It will be extremely valuable to profile them, identify hotspots, test improvements, and benchmark various changes in the field. This could move our donated network capacity from 9 TByte/day to perhaps even 15TByte/day.

lfdversluis commented 8 years ago

I would love to profile them yes. I already have a hunch where bottlenecks are, but it would be great having a profiler output data that can be visualized by e.g. Vallgrind to verify and confirm the bottlenecks.

synctext commented 8 years ago

Anonymous downloads are quite slow. As of V6.6 pre-release version we have with a solid i7 CPU core at 125% only 330 KByte/sec. Confirms @whirm his view. Both the tunnel helpers and anon downloader are CPU constrained.

tribler_6 6pre_full_cpu_tunnel_downloading_kbytes

qstokkink commented 8 years ago

I have made some measurements for the endpoint.py throughput per logging level: https://gist.github.com/qstokkink/57203ced56dccd95287193d43e7e6a34

Long story short: choosing the wrong logging level can cost you around 400 MB/s throughput.

whirm commented 8 years ago

@qstokkink hah. Just yesterday I was talking about removing unnecessary log calls when the big cleanup happens with @devos50.

qstokkink commented 8 years ago

Ok, I finally have a design I am happy with and which will provide a significant speed increase.

The main issues I faced were the following:

  1. Data passing through Dispersy clogs up Dispersy and thus the main process.
  2. As per (1), simply offloading received packets to subprocesses will only provide some benefit in encryption/decryption and add overhead for forwarding these packets.
  3. Forwarding peers' circuits to child processes (ergo the set-up happens in the main process), with their own assigned port, would require a lot of extra logic in the Tunnel Community.

So the shocking design I came up with (working with Dispersy, instead of against it, for a change), is as follows:

This means that the main process controls the circuits (the amount, their statistics and stuff) but no longer needs to be the man in the middle for data passing through. In turn, this means that the main process' Dispersy is freed up and therefore becomes a lot faster. An added benefit of this approach, is that it is a lot easier to implement than the options described in (2) and (3) too, leaving the TunnelCommunity almost completely the same.

These are the reasons for why I believe this to be the superior design, instead of the one of (2) as discussed during the last group session/meeting. Feel free to provide comments or critiques.

synctext commented 8 years ago

wow, wait, what? @qstokkink So one Dispersy instance per community. So we fork a new Python process for each community, with their own IPv4 listen socket, own database, and walker.

Is the process doing the setup on a different listen port then the Tunnel community itself? How does this work with UDP puncturing. Are we still puncturing the correct listen socket?

Sounds quite interesting. I would suggestion doing quick prototyping and get the Tunnel community in it's own Dispersy process.

qstokkink commented 8 years ago

@synctext Correct: one TunnelCommunity for one Dispersy with one unique listen port for one subprocess/forked process. Note that this will also be the only community loaded for a subprocess.

This should work perfectly fine with the UDP puncturing.

Because all of the TunnelCommunity messages are sent directly through the endpoint with NoAuthentication() the database of each subprocess is hardly used (only for Dispersy internals). In spite of this, there is definitely a case to be made for having a shared set of walker candidates and a shared database between subprocesses, as you suggested. For the first version/prototype of this I will not implement this however, the result should already be very dramatic without these optimizations.

devos50 commented 8 years ago

@qstokkink sounds like an interesting design. I'm curious to know how this works out. I'm a huge supporter of splitting Tribler in various processes, however, we should be careful that we are not overcomplicating communication between processes since I think it's important that (external) developers can easily modify the communication protocol used between the Dispersy processes.

Another advantage of splitting it up in processes is separation of concerns: developers can focus on a single part of the system (and in the future, even implement a small part of the system in another language).

With this design, we can utilise all these additional cores in the CPU 👍

qstokkink commented 8 years ago

@devos50 If anything, this should even simplify communications. The three messages being transferred are the following:

  1. Main process to subprocess: create(<>)
  2. Subprocess to main process: stats(<>)
  3. Main process to subprocess: destroy()

That's it.

EDIT: Sorry, there is a fourth: the notifications.

synctext commented 8 years ago

morning, sooo.... When selecting hops for a tunnel, we need a UDP punctured connection to it. Thus the Tunnel community needs to do all Create messages from its own external visible external IP:port. correct?

qstokkink commented 8 years ago

@synctext Correct, each subprocess will need to be punctured separately. If this is really a problem, the design can also use a single port: some very nasty code in endpoint.py already takes care of this.

Do note that using the single port will already hit the performance pretty hard.

synctext commented 8 years ago

soooo v2... The Tunnel community has a certain UDP punctured peers. It can tunnel through them. The main Dispersy then needs to either take that into account when selecting the hops or let the Tunnel community do everything autonomously.

synctext commented 8 years ago

ohhhh. Just had an important realisation. Documenting it here.

Tunnel community is special. It moves bulk data. No other community needs that or does that.

Cell packet: 5MBytes per second. Walker: once per 5 seconds!

This means Tunnel community is a single special case. Normal communities battle with correctness and code complexity. Agree?

qstokkink commented 8 years ago

@synctext Agreed. This is also why it is so disastrous to have it use the same port as everything else.

And to answer the former post: this is why the main process does not join the Dispersy community, but each subprocess does everything autonomously.

qstokkink commented 8 years ago

For anyone keeping tabs on this, I will update this checklist with the progress of this implementation:

synctext commented 8 years ago

Facinating usage of the Session construct

qstokkink commented 8 years ago

Since process management has become somewhat complicated, I will drop off some documentation here.

Creating children

In Python there are several ways to create children (i.a. subprocess, twisted, raw os calls and multiprocessing), which all boil down to using os.fork() or os.spawn(). Most of these constructs require you to reinvent the wheel and create your own serialization and pipe management system. The multiprocessing library takes care of this however and lets you use default constructs, like Queues. This makes the multiprocessing library the best choice, from a code management perspective. The underlying architecture of the multiprocessing library uses os.fork() however.

Getting forked

What a fork does is create a child process which has COPY_ON_WRITE memory sharing with the parent process. This is rather convenient for instantiating and sharing objects with child processes. However, with the huge codebase of Tribler and all its intricate workings, this blocks child processes from functioning. This is because of singletons being shared (including the twisted reactor). This means we cannot fork on demand and simply clear the child process' memory (without performing intricate reloading of modules, which is not scalable at all). This means a fork needs to be created when there are no singletons loaded, complex classes initiated, etc.: ergo, on import time.

When Mother forks and creates children

Since we cannot say how many child processes we want beforehand, to allow for scaling behavior in the amount of child processes we own, we cannot make a static amount of child processes on import time. To allow for properly scaling behavior we require a template process, which is loaded on import time, which can be forked later. We will refer to this template process as the Mother process. The Mother process does nothing but create and manage children.

We now have a situation where the Mother can create copies of itself with a small footprint, but it has no knowledge of when the main process needs to have forks. Thus on top of the mother we need another layer which initiates the forking. In other words, we need a manager of a manager. See the following sequence diagram for the interactions:

sequencediagram_tunnels

Lastly, there is one implementation detail to discuss. It would be possible to forward process handles in such a way that the Mother process only acts to spawn processes. In this case there is only one managing process (the main process). This would involve some unintended use of the multiprocessing library and therefore I left it out of the implementation for now.

The resulting architecture

What we are left with are the following components: the PooledTunnelCommunity using a WorkerManager, which controls the ProcessSpawner (the Mother) which spawns Subprocesses which have an autonomous multichain enabled HiddenCommunity for a given multichain permid. All of the stats and notifications generated by the bottom layer of none-or-more HiddenCommunities are then relayed to the top level PooledTunnelCommunity for stat and notification aggregation. This makes it seem for the main process as though it has a normal HiddenCommunity, generating notifications and stats, except for the fact that it will no longer be processing raw data pushing through its socket.

tunnels_arch

As a final note: each of the autonomous Subprocesses run a Session, but this is not a full blown Session such as the main process's Tribler runs. It is a severely limited session, which only provides HiddenCommunity interactions. Why not only load this community then through Dispersy? Because you can't run a HiddenCommunity without a Session: this would require major refactoring inside the HiddenCommunity itself. Right now the design is completely compatible with the current architecture.

P.S. I left out some technicalities from this post to keep it simpler. This is just to give a high-level impression. P.P.S. Please excuse the crude drawings.

synctext commented 8 years ago

quite impressive

qstokkink commented 8 years ago

First milestone https://github.com/qstokkink/tribler/commit/3611014b1d9656ae7c08862cfed165fbe639a6be, things are coming together. This version will run passive autonomous packet relayers for the HiddenCommunity (one per process).

It can't create its own circuits yet, but this does show that everything will work as described in my previous (long) post.

EDIT: I also hooked up the bartercast statistics now, so you can get a feel for how it works. tupdownstats

EDIT: Circuit building is up and running, only download forwarding to go. circuit_building

EDIT: Some more unnecessary hype building. tunnels_success

lfdversluis commented 8 years ago

Very interesting read @qstokkink didn't know that a fork would also copy the singletons. I assume that Twisted subprocesses then also make use of the singleton reactor? Weird though, because that would mean that two processes running two python interpreters are using only one reactor...

Edit: I think you also should include this page in the real documentation on readthedocs once it's final, this issue might get 'lost' once it gets closed

qstokkink commented 8 years ago

@lfdversluis Well the two processes are using what appears to be the same reactor, until the subprocess tries to alter the memory (reading the other unaltered memory of the reactor is still possible). This leads to really strange behavior.

Also, that's a good point about the readthedocs :+1: .

qstokkink commented 8 years ago

Some shameless self-promotion here. Everything seems to be hooked up properly (read: barely fuctional) and is ready for testing! progress

lfdversluis commented 8 years ago

@qstokkink I heard you had a nasty bug, how it's going with that?

synctext commented 8 years ago

quite eager to see the new download performance numbers. Can we do 5MB/s....

qstokkink commented 8 years ago

@lfdversluis that's with the protobuf port, I still don't know what is causing it (although I'm 99% sure its only a cosmetic/non-critical bug).

@synctext I'll have to set up a gumby experiment to test the new capabilities.

qstokkink commented 8 years ago

Some results from the 1-gigabyte-1-hop experiment (5 nodes 20 processes): perf This stabilized at around 550kbps, although I have downloaded a well-seeded 350 MB file in the wild in about two minutes.

At this point the status is:

devos50 commented 8 years ago

Good work! Do you get feedback about progress when using the API (/downloads endpoint)? Maybe you can use that to visualise the progress of a download?

qstokkink commented 8 years ago

@devos50 Well, the problem is that the subprocess maintains its own total up/total down, etc.: so the progress data exists, but the main process doesn't know about it.

This is a matter of forwarding all of the information to the main process. Sadly, this doesn't happen in one spot, so this is mostly a matter of trying to figure out where data is being pulled from (ex. some data is pulled from notifications, some data is pulled directly from the TunnelCommunity, some data is pulled from Circuits, some data is extracted from LibTorrent, etc.).

Ideally these stats would only be shared between the community and the rest of Tribler through notifications (on update) and some stats object tied to the community. But, this is not the case.

qstokkink commented 8 years ago

Upon inspection of the open issues, the current thing I am stuck on is exactly issue #2566 . I suppose the shift of this project will now change to refactor all to the code to elegant industry-standard code. Right now the implementation is thusly horrible that it really can't be merged into Tribler.

qstokkink commented 8 years ago

Alright, the refactoring of the original code is now in a functional state.

Preliminary results are as follows, for running a 16 _mincircuits, 32 _maxcircuits quad-core experiment in the wild:

Roughly speaking we can run twice as many circuits on a core not running the main Tribler process. So for a machine with n cores, the relative amount of circuits C which can be run (and thus more or less the speedup) compared to running n tunnelhelpers, is roughly: C = 2 - 1/n

devos50 commented 8 years ago

@qstokkink maybe it's a good idea to create an (short-running?) experiment on the DAS5 with many nodes so you can compare CPU usage/download speed more easily across the devel branch and your work?

Also, do you have an estimate about when we can expect a PR?

qstokkink commented 8 years ago

@devos50 Good point on the experiment: I still have an experiment for the old code, with a little luck I can re-use it for this new code.

As far as the PR goes:

  1. I still need to make sure all the files/functions are properly commented and do a style check.
  2. I also need some way to enable/disable the pooling of the tunnelcommunity in the GUI. @devos50 What's your advice on how/where to do this?
  3. I need to merge in the latest devel changes (I'm currently using some custom fixes, which are fixed in the current devel).

In short: nothing major. I estimate next Tuesday if you don't want to wait for the gumby experiment and next Thursday if you do.

qstokkink commented 8 years ago

To elaborate on point no. 2: this interferes/interacts with #2153 and #2564 .

devos50 commented 8 years ago

@qstokkink you want to add a configuration option to the GUI and tribler config file right?

qstokkink commented 8 years ago

@devos50 yes, I believe it to be unwise to enable this by default

devos50 commented 8 years ago

@qstokkink new features will be disabled by default anyway until they are stable and well-tested. I think the easiest solution for you is to create PR on #2153 and #2564 so your changes will be included when this PR is merged. You can already create the configuration option even though your main PR is not merged yet.

qstokkink commented 8 years ago

@devos50 with 'create a PR on #2153 and #2564', do you mean: merge those changes into my branch? Or is there some git black magic which allows you to create PRs on PRs?

devos50 commented 8 years ago

@qstokkink you can create a PR on a branch of my fork of Tribler (i.e. https://github.com/Tribler/tribler/compare/devel...devos50:devel)

qstokkink commented 8 years ago

@devos50 alright, I asssume you would like it on the single_config branch?

devos50 commented 8 years ago

@qstokkink yes that's correct. Please not that I can't give guarantees about when this PR will be merged since I'm fixing other more major bugs first ;)

qstokkink commented 7 years ago

Finished the comments/style corrections and sanity check. Almost there.. I might even have the PR done by tonight. The TODO list has become quite short.

devos50 commented 7 years ago

@qstokkink looking forward to it!

Note that you don't have to squash everything into a single commit. Please make a logical units of works that make sense and make sure your commit history is clean (consists of distinct changes so no fixes on fixes).

qstokkink commented 7 years ago

Woops, not happening tonight, I accidentally merged in some devel changes.. :cry: https://github.com/devos50/tribler/compare/single_config...qstokkink:tunnelprocess_rc : I'll have to fix it up tomorrow.

qstokkink commented 7 years ago

Ok, I know this needs no further hyping up, but this is pretty cool: perf EDIT: Oh and I almost forgot to mention it: but this is casually running 18 circuits.

synctext commented 7 years ago

Repeating: https://github.com/Tribler/tribler/issues/2106#issuecomment-247278053 Thesis material:

Together: Multi-core architecture for anonymous Internet streaming

First priority: PooledTunnelCommunity stable :clap:

synctext commented 7 years ago

Further future (by Yawning Angel): https://github.com/Tribler/tribler/issues/1066#issuecomment-68088819