end-to-end anonymous seeding and download performance test

synctext commented 8 years ago

Privacy with usability is the core goal of our ongoing 11 year effort.

This end-to-end test is designed to identify problems in our whole Tribler chain, from session key exchange to rendezvous peer discovery.

Scenario:

Create a new file, generate magnet, and seed it
Start anonymous download on different machine and seed anonymously
repeat up to 8x the start download, start 1 new downloader every minute.
Disconnect the original public seeder, after the 10th minute in the experiment
Start an anonymous download on a different machine using only a hash
Observe 1) speed of discovery of hidden seeds, 2) PEX mechanism, and 3) total download speed

synctext commented 6 years ago

This 2016 issue has overlap with newer #3325 issue.

ichorid commented 6 years ago

I've created and ran a local performance test based on LXC containers. The test setup is: seeder <=> router <=> leecher

Seeder is libtorrent clien_test utility.
Router is tunnel_helper Tribler plugin for Twisted
Leecher is a leecher.py from latency-testing repo

My machine is Xeon W-2133 3,6 GHz.

Direct download (without router) is 150 MBytes/sec. Routed anonymous download is 1,2 MBytes/sec. Router Twisted process hogs 125% of CPU, distributed between several physical cores. Leecher/seeder CPU usage is negligible.

Our anonymous download speed is completely obliterated by Python performance on the router.

EDIT: Leecher CPU usage is non-negligible. It is about 107% CPU. Thus, it is only 15% smaller than that of the router.

ichorid commented 6 years ago

To estimate the limits of Twisted networking performance, I did a small packet forwarding test with LXC containers and nc | pv | dd method: nc <= twisted forwarder <= nc. As an alternative forwarder, I used nc | dd | nc.

	/dev/zero	/dev/urandom
`nc <= twisted <= nc` TCP	320 MBytes/s	180 MBytes/s
`nc <= twisted <= nc` UDP	120 MBytes/s	120 MBytes/s
`nc <= nc <= nc` TCP	300~500 MBytes/s	180 MBytes/s
`nc <= nc <= nc` UDP	270 MBytes/s	170 MBytes/s

It is interesting to note that nc|dd|nc sandwich uses almost 250 % CPU in total, in a kind of functional parallelism. Twisted, on the other hand, uses only a single CPU - always at 100%. From this table, it is obvious that Twisted UDP performance is the bottleneck in this test. However, we would be very happy to attain even 1/10th of it.

synctext commented 6 years ago

Great to hear Python+twisted does not need replacing.

ichorid commented 6 years ago

Another simple experiment: CPU usage when bombing Tribler tunnel_helper exit node with packets. Performance of nc -u -l | pv | dd of=/dev/null and simpler Twisted-based, do-nothing socket listener are measured to estimate baseline performance:

	nc+pv+dd	Twisted listen	Tribler abcd*600	Tribler ffffffff+abcd*600	Tribler ffffffffffffffe+abcd*600
1000 Mbit/s	170%	93%	172%
100 Mbit/s	27%	34%	172%	172%	185%
50 Mbit/s	20%	20%	172%	172%	185%
25 Mbit/s	9%	10%	172%	172%	185%
10 Mbit/s	4%	5%	62%	60%	96%
5 Mbit/s		5%	31%	32%	50%
1 Mbit/s		1,3%	8%	8%	10%

CPU performance is given as the sum of CPU usage of all relevant subprocesses as printed by htop. Tribler's 172% is the sum of 11 python subprocesses of ~9% each and a single process of ~40%. (BTW, where the hell the remaining ~40% hide ?)

Tribler was tested with 3 different types of packets: "random" packet of abcd*300, and packets with two types of headers for tunnel community. Tribler tunnel helper was not alternated in any way, so it just listend to the port, picks up packets, tries to parse them, sees them being garbage and drops them.

With ff..fe header Tribler tries to decode the packet even for unknown ID (that's already a potential problem). Hitting 96% of CPU at around 1 Mbytes/s of throughput corresponds well with my previous test of Tribler tunnels. That test showed ~125% of CPU load at 1,2 Mbytes/s (though, the exit node was both receiving and sending packets).

Aside from the potential pitfall of exception handling, it is highly improbable for a network application to demonstrate better performance when it both receives and sends packets, than when it just receives packets. Therefore, the performance bottleneck in the path of a packet lies somewhere between socket input and tunnel_community.on_data() method.

ichorid commented 6 years ago

One bottleneck identified: each active community adds about 50% of CPU load at 100Mbit/s UDP bombing test. The problem is, Endpoint.notify_listeners() notifies each listener via broadcast:

    def notify_listeners(self, packet):

        for listener in self._listeners:
            if listener.use_main_thread:
                reactor.callFromThread(self._deliver_later, listener, packet)
            elif reactor.running:
                reactor.callInThread(self._deliver_later, listener, packet)

This is very inefficient because each community (listener) does its own pattern matching. It is improbable that a single packet would be addressed simultaneously to several communities at once. And even in this case it should be processed only by the targeted communities.

We need to implement a central tree-like packet demultiplexer, or something like that.

ichorid commented 6 years ago

A quick hack that stops notifying all communities except hidden tunnel community after 15 seconds of running Tribler increases torrent download speed from 0,9 Mbyte/s to 3,9 Mbyte/s. (Hack should be applied to both leecher and exit node.) Exit node's CPU usage drops to 98%. Of all Twisted threads only one still generates 20% CPU load, and other threads stay idle. The rest of the load is generated by the main thread, I presume.

Apparently, we now hit another bottleneck.

ichorid commented 6 years ago

Another experiment: measure pure bandwidth of exit node without leecher, in the direction seeder->exit_node->leecher. (Community bypass hack enabled!) To do that, we first disable method for taking down circuits, so the established circuit/tunnel would last forever. Next, we initiate the usual seeder<-exit_node<-leecher, so the circuit would be initiated properly by the exit node. We sniff the exit port of the circuit Twisted protocol. Now we take down both seeder and leecher client apps, so only the exit node Tribler instance is there. And finally we start bombing the exit node with UDP packets from the seeder side, pretending to be the seeder app. We do it by using correct (sniffed) src and dst UDP ports. On the both seeder and leecher we run bmon to see the actual bandwidth. All of this is possible because UDP is stateless.

The result: Seeder bombs with 139 MByte/s, Leecher receives 21,4 Mbytes/s. Exit node at 100% CPU. Verdict: our next bottleneck is at the leecher's side. screenshot from 2018-04-13 16-13-12

Some screenshots from Kcachegrindin' a yappi profile of the exit node in this mode:

Giant "unknown" blob: exit_prof

Contents of the blob: exit2 exit3

So, now its mostly SSL stuff.

synctext commented 6 years ago

Impressive progress! You discovered an architectural violation. Communities should be completely isolated from each other and thus only 1 community should process 1 incoming UDP packer. Sadly, I don't know if AllChannel or other communities critically depend on this feature, which is a bug/hack.

Smart twist in this experiment design! Seeder is replaced by a UDP sender with a fixed test pattern that does not do any congestion control. Fixed rate blasting at the exit node, which then uses full onion encryption to deliver the test pattern to the leecher. Leecher attempt to decypher the incoming data into Tor-like encrypted Bittorrent traffic.

What would 4 cores running 4 seeders with 4 cores with 4 leechers do? The single exit node and real congestion control would hint at bottleneck.

ichorid commented 6 years ago

Experiment: multiple leechers vs multiple seeders:

1 leecher x 2 seeders: 4,3 Mbytes/s (exit node CPU 98%)
1 leecher x 4 seeders: 4,5 Mbytes/s (exit node CPU 98%)
2 (1+1) leechers x 4 (2+2) seeders (2 independent torrents): 5,3 Mbytes/s (exit node CPU 120%)

These results show that we have bottlenecks both in the leecher and the exit node. Most probably, there are 2 bottlenecks: one in the direction of upload (ACK packets from leecher to seeder), and another one still in the direction of download (DATA packets from seeder to leecher).

ichorid commented 6 years ago

A superficial analysis of profiler charts revealed another potential problem: we recreate the crypto object every time we need to encrypt or decrypt something. This includes a very costly operation of cipher initialization in OpenSSL, that is responsible for 17% of total CPU usage by the exit node, according to the profiler.

UPDATE: we're using GMAC mode with AES encryption, so if we want to have each packet independent of each other, we must use different IV for every packet/encryption. No low-hanging fruit there. Still, constant calls to _CipherContext.__init__ seem suspicious. Maybe we're not doing it properly, and there is room for improvement. For example, latest versions of SSL recommend using the compound AESGCM procedure instead of setting up AES+GCM by hand.

ichorid commented 6 years ago

Profiler screenshots showing mysterious "cycle 10" encompassing 80% of CPU ticks (probably, Twisted related): c10 c10-1

EDIT: according to @devos50, "cycle 10" is Twisted itself. Mystery solved.

synctext commented 6 years ago

mental note: earlier discovery was that Twisted "thread pools" cost performance. By moving to main-thread, numerous context switches are avoided and performance could also double here. @qstokkink @devos50

ichorid commented 6 years ago

Currently, we poll sockets ourselves in an endless loop. This is clearly not the thing Twisted developers were expecting from their users, since Twisted has facilities like DatagramReceived and LoopingCall. Most probably, our "endless" loop severely messes up Twisted polling schedule. In a simple experiment I replaced UDPEndpoint._loop with LoopingCall() (0.001 sec. period). Performance of 1 seed vs 1 leech increased to 4,5 Mbytes/s, and performance of 2+2 seeds vs 1+1 leeches increased to 7,5 Mbytes/s. On the other hand, the " UDP bombing test" performance dropped to 2 Mbytes/s! We're doing something wrong with polling it manually. Remind that pure Twisted DatagramReceived based forwarding was capable of pushing 120 Mbytes/s. It is clear that there is room for improvement here.

ichorid commented 6 years ago

@synctext, tried that. Disabling "thread pools" does not affect performance in my tests, but removes "unknown" objects in profiler, and makes profiling Twisted manageable.

qstokkink commented 6 years ago

Your comments touch upon a great many Python and Twisted nuances and intricacies, let me try and organize this into a shortlist:

The UDPEndpoint does not run on a Twisted thread, but on a (built-in) Thread.
The UDPEndpoint does not use a socket managed by Twisted, rather it uses a (built-in) socket.
By using a LoopingCall you are scheduling the socket send/receive logic onto the (Twisted) main thread. This may or may not be the Python entry thread.
By doing this you are avoiding expensive context switches into the Twisted threadpool, but, this also means that a busy socket will slow down all of the other application logic. This is your dip in throughput under load.
In the Twisted back-end the Twisted threadpool is just a fixed set of (built-in) Thread objects, 20 by default, onto which contexts are loaded and run. As we have previously discovered, this is painfully slow. This packing into little balls of context is what is causing your unknown origin processing.
The native Python socket implementation can actually handle well over 800 MB/s.

Final points:

None of this would be a problem if Python had real threading instead of this GIL stuff.
Twisted is a performance killer. Don't let it fool you.

ichorid commented 6 years ago

@qstokkink, thanks for a detailed explanation! But why would we even have UDPEndpoing in a built-in thread, if context switching makes our application (at least exit node) slower? What's the point? We don't want 800 MB/s of theoretical throughput in a single part of our application if the rest of the application can't make more than 4 MBytes/s. We want a single thread of our application to do about 20-50Mbytes/s: that's enough to saturate any modern broadband connection.

qstokkink commented 6 years ago

@ichorid we did not have the time, nor the indication that we should do this before. If you want, you can try implementing a full Twisted-based endpoint.

ichorid commented 6 years ago

UDP bombing with crypto disabled: 70 MBytes/s on the exit node. Regular download with crypto disabled: 7,5 MBytes/s. Something fishy happens on the leecher side...

ichorid commented 6 years ago

While UDP bombing the exit node, its memory usage remains constant. However, the leecher's memory usage grows very fast, as if it internally buffers all the packets it can't process. Even when the stream of packets ends, the leecher's CPU usage remains 100% for a long time, until it finishes processing the buffered data. Its memory usage drops somewhat along the way. This happens even with crypto disabled. Another interesting detail is that the "bomb" packets are received and buffered by leecher, but they are not sent down through the socks proxy to libtorrent - there is no traffic on the loopback interface.

qstokkink commented 6 years ago

I'm not sure, but I think the packets not arriving at libtorrent is expected due to the data format.

The UDP packets being sent directly to the seeder is a high priority bug.

ichorid commented 6 years ago

OK, I think I got how UDP packets could go back to the seeder directly. When we inject packets back to the router node with IPv8 header, it would happily accept it because of its could_be_something() filters check everything in a single place, on the packet entry. We would expect the router node to at least check that uTP packets enter only through the BT port, and IPv8 packets enter only through overlay listener port. However, they somehow go through. This could happen because we use system calls and poll sockets ourselves, not using Twisted's Protocol facilities. And we start exit connections using these facilities. After receiving such packet, the exit node sends it back encrypted to the leecher, which accepts it, decrypts it, interprets it as IPv8 and send "back" to seeder. It would take more than mental debugging to investigate how that happens, but I guess its the Dispersy legacy in the form of on_packet call from on_data method.

ichorid commented 6 years ago

On GIL

synctext commented 6 years ago

Porting to Python 3.6 would also help I think.

ichorid commented 6 years ago

@synctext, it definitely could, but there are some concerns (page 54) about the new GIL implementation in 3.2. The author of that presentation shows that for the corner case of "1 heavy IO thread + 1 heavy CPU thread" the performance of 2 threads on 2+ core system could drop up to 330x times! This problem is probably solved in later versions of Python, but it shows us that there is more to it than just porting to a newer version and hoping it would magically work. We need to discuss this stuff thoroughly and consolidate our knowledge before taking any action.

ichorid commented 6 years ago

The bottleneck causing the memory leak/blowout is hiding somewhere behind self.transport.write(data, self.remote_udp_address)

So, it's Twisted problem.

qstokkink commented 6 years ago

@ichorid that sucks, a lot. I don't think there is any easy way to change our Socks5 implementation.

ichorid commented 6 years ago

Leak solved. It was callFromThread(self._deliver_later) in notifiy_listeners We do not use it by default, that's why we had never seen it. On the other hand, different ways of calling _deliver_later result in different performance levels (leech results are for localhost traffic, i.e. out through SOCKS):

Way of calling `self._deliver_later` \ bandwidth	leech plain Mbytes/s	leech encrypted Mbytes/s	router plain Mbytes/s	router encrypted Mbytes/s	Leak?
`callInThread`	3,3	2,0	70,0	22,5	No
`callFromThread`	13-20	4,2	77,0	22,5	Yes
Direct call	31,2	14,6	78,0	22,5	No

Note, that this 14,6 Mbytes/s is not the libtorrent download performance. It is "UDP bombing" performance. Torrent downloading does 4,5 Mbytes/s with these modifications (libtorrent 1.1.6, without uTP window size fix. The fixed version does not affect the performance, though.).

qstokkink commented 6 years ago

Does this leak happen if you add:

if len(reactor.getDelayedCalls()) > 20:
    return

Here: https://github.com/Tribler/py-ipv8/blob/bd8423e9a89eb33b084c2e7990d326bd09a4969e/ipv8/messaging/interfaces/endpoint.py#L51

?

synctext commented 6 years ago

Brainstorm.., towards microarchitecture. Is there a twisted alternative? Not this one, https://github.com/MagicStack/uvloop/issues/14

ichorid commented 6 years ago

@qstokkink, when I add this line, the download starts and dies almost immediately.

ichorid commented 6 years ago

@synctext, it's really a shame that uvloop does not work as a drop-in replacement of reactor loop with Windows ((( However, this fact does not stop us from using libuv for implementing a C-based endpoint, or using uvloop to speed-up Linux Python 3.6+ performance, etc. According to authors, uvloop does not provide any API, it is just a more efficient replacement for the Python 3 core reactor.

ichorid commented 6 years ago

We have seen good enough benchmark performance for encrypted tunnel throughput (14 Mbytes/s). However, torrent download speed is 4 times worse. Why?

egbertbouman commented 6 years ago

Just got this when doing direct calls:

Traceback (most recent call last):
  File "C:\Users\Egbert\Desktop\git\tribler\TriblerGUI\event_request_manager.py", line 127, in on_read_data
    raise RuntimeError(json_dict["event"]["text"])
RuntimeError: [Failure instance: Traceback: <type 'exceptions.RuntimeError'>: dictionary changed size during iteration
C:\Python27\lib\site-packages\twisted\internet\base.py:1199:run
C:\Python27\lib\site-packages\twisted\internet\base.py:1208:mainLoop
C:\Python27\lib\site-packages\twisted\internet\base.py:828:runUntilCurrent
C:\Python27\lib\site-packages\twisted\internet\task.py:239:__call__
--- <exception caught here> ---
C:\Python27\lib\site-packages\twisted\internet\defer.py:150:maybeDeferred
C:\Users\Egbert\Desktop\git\tribler\Tribler\pyipv8\ipv8\messaging\anonymization\community.py:217:do_circuits
C:\Users\Egbert\Desktop\git\tribler\Tribler\pyipv8\ipv8\messaging\anonymization\community.py:242:do_remove
C:\Users\Egbert\Desktop\git\tribler\Tribler\community\triblertunnel\community.py:327:remove_circuit
C:\Users\Egbert\Desktop\git\tribler\Tribler\community\triblertunnel\dispatcher.py:87:circuit_dead
C:\Users\Egbert\Desktop\git\tribler\Tribler\community\triblertunnel\dispatcher.py:87:<genexpr>
]

qstokkink commented 6 years ago

The dictionary size change is due to multiple "threads" accessing the same dictionary. We could use something like in IPv8:

lock = RLock()
def synchronized(f):
    def wrapper(self, *args, **kwargs):
        with lock:
            return f(self, *args, **kwargs)
return wrapper

And then annotate the class methods with @synchronized.

ichorid commented 6 years ago

That's one option. Another one is to investigate what lines access the dictionary and try to remove that access completely. Or shift all the usages into a single "thread". BTW, this would help with further isolating IPv8 and preparing it to move to a separate process.

ichorid commented 6 years ago

OK, the communication scheme between old Dispersy listener and new IPv8 listener is completely broken and brain-dead . It messed up the results of my old experiments. When I correctly disable Dispersy and enable LoopingCall()-based endpoint on both router and leecher the performance goes to 8 Mbytes/sec! It is important to note that broadcast fix has no effect now, and UDP bombing nets stable 22/11 Mbytes/sec router/leecher performance.

ichorid commented 6 years ago

Different call methods effect with LoopingCall endpoint :

Way of calling `deliver_later()	leech encrypted MB/s	router encrypted MB/s	Leak
`CallInThread`	1,3	21,5	No
`CallFromThread`	11	22	No
Direct call	13	22	No

So, it's safe to use CallFromThread, since there is no leak now.

Indeed, Twisted is magic.

qstokkink commented 6 years ago

@ichorid Cool: you just fixed 10-year-old crappy code; when are we getting the PR?

synctext commented 6 years ago

8 Mbytes/sec! as leecher. Impressive. PR and Jenkins job would be most welcome. We could install a rule that future code contributions should not reduce anon download speed. On all supported platforms have performance regression tests.

ichorid commented 6 years ago

I have just reimplemented IPv8 endpoint with Twisted's DatagramProtocol. The results are basically the same as in previous experiment, except the performance is 10% worse, and CPU is 100% only when there are actual packets going through the system.

p.s. GUI-based Tribler instantly jumps to 6.1 Mbytes/s download speed on Ubuntu 18.04 torrent! Yay!

ichorid commented 6 years ago

Screenshot of the day:

synctext commented 5 years ago

Latest scientific related work in this area, from OSDI 2018 (still server-based; not suitable for Youtube-like systems):

Karaoke: Distributed Private Messaging Immune to Passive Traffic Analysis Karaoke achieves high performance by addressing two challenges faced by prior systems. The first is that differential privacy requires continuously adding noise messages, which leads to high overheads. Karaoke avoids this using optimistic indistinguishability: in the common case, Karaoke reveals no information to the adversary, and Karaoke clients can detect precisely when information may be revealed (thus requiring less noise). The second challenge lies in generating sufficient noise in a distributed system where some nodes may be malicious. Prior work either required each server to generate enough noise on its own, or used expensive verifiable shuffles to prevent any message loss. Karaoke achieves high performance using efficient noise verification, generating noise across many servers and using Bloom filters to efficiently check if any noise messages have been discarded. These techniques allow our prototype of Karaoke to achieve a latency of 6.8 seconds for 2M users.

Like Stadium, Karaoke is distributed over many machines, and must ensure that malicious servers do not compromise privacy. Stadium uses zero-knowledge proofs (e.g., verifiable shuffles) for this purpose, whereas Karaoke relies on more efficient Bloom filter checks.

ichorid commented 5 years ago

From https://magic.io/blog/uvloop-blazing-fast-python-networking/

uvloop on Python vs Go and Twisted: Raw TCP

uvloop on Python vs Go: HTTP echo server uv2

From https://www.nexedi.com/NXD-Document.Blog.UVLoop.Python.Benchmark: uvloop on Python vs Go: HTTP server with some minimal data processing pyuv_vs_go

This tells us the bottleneck is typically not the reactor loop, but Python itself.

synctext commented 5 years ago

How would these graphs look with UDP usage?

UVloop with 80.000 req/sec at 10 KiB = 800 MByte/sec, correct?

ichorid commented 5 years ago

@synctext , yes, that is 800 MBytes/sec. It is known that in simple networking benchmarks Go is typically 2-3 times slower than bare C/C++. That correlates well with other results obtained with libuv, e.g. CJDNS (~1GByte/s without auth/encryption). It is hard to say what the numbers would be if UDP was used instead. However, I'm sure the proportion would stay the same.

synctext commented 5 years ago

Please post a performance picture of your solution for #4567 within this issue @dsluijk @NULLx76.

synctext commented 3 years ago

We lack a structured performance evaluation suite. The content of #3325 is not available in easy to use dashboard #4999. This old 2016 issue is still relevant. We played around with various Twisted setting, now we are in the Python3 Async era. No idea if we're still at the 6.1 MByte/sec mark from May 2018, shown above. planning. After the channels plus search made progress, this is next to finish and not visit again for a while.

Related work, see the beautiful writeup at the top NSDI conference. Zero-deployment of these ideas, just simulation code https://github.com/tschorsch/nstor Hard to beat that with our hard work.

xoriole commented 9 months ago

Since @egbertbouman is the responsible person for the new tunnels upgrade using Rust and there seems already good signs of progress there, I'm unassigning myself and assigning him.

ichorid commented 3 months ago

BoringTun is a userspace Wireguard library written in sans-io style. This means it will be incredibly easy to plug it in to use IPv8 as the transport.

So, why not switch our home-grown tunneling solution to the de-facto industry standard for little to no development cost?

Tribler / tribler

end-to-end anonymous seeding and download performance test #2548