Allow Zeronet python to use multiple CPU threads?

slrslr commented 5 years ago

This is issue especially on zeronet proxies or when simply one is hosting many sites. You can see the zeronet is maxing out the CPU thread and leaves rest CPU threads unused.. I experienced this on all Linux distributions on which i tried to setup zeronet with bigger number of zites and i think it may be lagging my Windows zeronet too. Here is what largest zeronet proxy maintainer (d14n@) says about this.

Can ZeroNet be made to utilize more CPU threads?

https://stackoverflow.com/questions/4496680/python-threads-all-executing-on-a-single-core https://stackoverflow.com/questions/203912/does-python-support-multiprocessor-multicore-programming https://wiki.python.org/moin/GlobalInterpreterLock https://docs.python.org/3/library/multiprocessing.html

HelloZeroNet commented 5 years ago

In python the only proper way to do this is by using multiple processes, but syncing data between processes could be slow (usually done by sockets), so it would have significant performance drawback.

Right now if you want to seed many sites I recommend to run multiple clients on same machine.

HelloZeroNet commented 5 years ago

I did a test of file writes:

Writing 100 x 10k file to an SD card:

Serial open().write(): 1.109s, blocks execution: yes (current method)
Parallel, gevent.threadpool(1): 1.392s, blocks execution: no
Parallel, gevent.threadpool(5): 1.000s, blocks execution: no
Parallel, gevent.threadpool(10): 0.922s, blocks execution: no
Parallel, gevent.threadpool(50): 0.843s, blocks execution: no
Parallel, gevent.threadpool(1000): 0.875s, blocks execution: no
Serial, gevent.fileobject.FileObjectThread: 1.437s, blocks execution: no
Parallel, gevent.fileobject.FileObjectThread: 1.109s, blocks execution: sometimes for 0.2-0.3s

On an SSD:

Serial open().write(): 0.047s, blocks execution: yes
Parallel, gevent.threadpool(1000): 0.015s, blocks execution: no

So the current method we use is blocks the code execution during the file write. This is usually not a problem on an SSD, but could be problematic on slower or overloaded storage.

As a first step I think it would not be hard to put file writes to different thread.

This is still not multi-cpu execution, but file writes are not cpu bound, so it's not necessary.

HelloZeroNet commented 5 years ago

CPU/MEM usage a bit higher, but pool with 5 threads looks acceptable. As alternative we could detect slow writes and enable only threaded writes if it's necessary.

Direct call: Write done in 15.586s Memory usage: +1.535MB, user: +0.203s, system: +1.734s

Pool 1: Write done in 16.924s Memory usage: +4.059MB, user: +0.719s, system: +1.375s

Pool 5: Write done in 10.568s Memory usage: +4.375MB, user: +2.000s, system: +4.562s

Pool 10: Write done in 9.312s Memory usage: +4.188MB, user: +1.844s, system: +4.750s

Pool 100: Write done in 9.246s Memory usage: +6.480MB, user: +1.219s, system: +4.078s

(writing 1000 files to a SD card)

HelloZeroNet commented 5 years ago

Test for database connection on SD card:

Same thread (current method):
- Block: 2.2306363582611084
Select executed in 4.371s found: 11
. Memory usage: +21.145MB, user: +3.391s, system: +0.812s
- Block: 2.236405849456787

Threadpool(1):
Select executed in 4.234s found: 11
. Memory usage: +22.566MB, user: +3.156s, system: +0.641s

Threadpool(5):
...
- Block: 0.15651345252990723
- Block: 0.2711312770843506
- Block: 0.21411705017089844
Select executed in 4.992s found: 11
. Memory usage: +22.566MB, user: +3.625s, system: +0.906s

So executing db in parallel makes it slower and also blocks the main thread for some reasons, but using a single threadpool looks fine: removes the blocking and does not have significant performance drawback.

HelloZeroNet commented 5 years ago

Test of cryptographic function (signature verification with coincurve):

Same thread (current method):
- Verify x100 x 1000: 17.698s
Memory usage: +0.102MB, user: +16.531s, system: +0.016s
Block: 17.8211407661438

Threadpool(1)
- Verify x100 x 1000: 17.749s
Memory usage: +0.223MB, user: +16.516s, system: +0.031s

Threadpool(2)
- Verify x100 x 1000: 8.544s
Memory usage: +0.258MB, user: +9.500s, system: +1.125s

Threadpool(4)
- Verify x100 x 1000: 16.300s b'1N2XWu5soeppX2qUjvrf81rpdbShKJrjTr'
Memory usage: +0.301MB, user: +31.391s, system: +1.312s

So the good news is binary dependencies releases the GIL and makes use of multiple cores without the need of multiprocessing. The bad news is the performance degrades drastically when we use more Threads than physical cores. (Tested on a dualcore machine with HT)

Note after testing coincurve on a quad-core cpu (1T: 10.380s, 2T: 7.065s, 4T: 6.519s) looks like it does not scales to more than 2 cores

Update:

Sha512 5MB x 50 times test results on 4core (no HT) machine (Win10):

Same thread (current method): 0.895s (blocks, Memory usage: +0.078MB, user: +0.797s, sys: +0.109s)
Threadpool(1): 0.890s (no block, Memory usage: +0.250MB, user: +0.812s, system: +0.078s)
Threadpool(2): 0.528s (no block, Memory usage: +0.293MB, user: +0.797s, system: +0.109s)
Threadpool(3): 0.403s (no block, Memory usage: +0.445MB, user: +0.734s, system: +0.109s)
Threadpool(4): 0.343s (no block, Memory usage: +0.508MB, user: +0.766s, system: +0.094s)
Threadpool(10): 0.343s (some block, Memory usage: +0.582MB, user: +0.688s, system: +0.125s)

Interesting results on the same machine in WSL:

Same thread (current method): 0.429s (blocks, Memory usage: +5.102MB, user: +0.420s, sys: +0.000s)
Threadpool(1): 0.444s (no block, Memory usage: +5.297MB, user: +0.450s, system: +0.030s)
Threadpool(2): 0.255s (no block, Memory usage: +10.504MB, user: +0.470s, system: +0.010s)
Threadpool(3): 0.207s (no block, Memory usage: +15.652MB, user: +0.420s, system: +0.030s)
Threadpool(4): 0.191s (no block, Memory usage: +20.918MB, user: +0.450s, system: +0.060s)
Threadpool(10): 0.263s (some block, Memory usage: +52.188MB, user: +0.390s, system: +0.120s)

So the python running in WSL provides almost twice as fast crypto functions

Update 2:

As for CryptMessage / OpenSSL / Ecies decrypt 10x100message:

OpenSSL 1.0.2q: 1 Thread: 1.704s, 4 Thread: 1.177s
OpenSSL 1.1.0j: 1 Thread: 3.514s, 4 Thread: 1.123s
OpenSSL 1.1.1d: 1 Thread: 1.243s, 4 Thread: 0.424s
WSL + OpenSSL 1.1.1-1ubuntu2.1~18.04.4: 1 Thread: 1.032s, 4 Thread: 0.390s

So looks like there is pretty significant difference between versions of OpenSSL library, but it works fine in separate thread.

Update 3

Looks like there is significant difference between compiled openssl dlls (all 1.1.1d 64bit):

"Official" from: 1T: 1.224s, 4T: 0.422s
Curl for win: 1T: 0.928s, 4T: 0.332s
Conda repo, compiled using vc9: 1T: 1.755s, 4T: 0.505s
Conda repo, compiled using vc14: 1T: 2.189s, 4T: 0.705s
Conda repo, compiled using vs2015: 1T: 1.252s, 4T: 0.421s

d14na commented 5 years ago

Right now if you want to seed many sites I recommend to run multiple clients on same machine.

yes, this is how we stopped the service from crashing continuously; using an nginx proxy to run multiple docker instances has made significant difference, as you can see from https://status.0net.io.

next year will be moving to at least 3 individual machines, in geo spaced locations, for redundancy

HelloZeroNet commented 5 years ago

Threaded file writes landed in Rev4287. Making db and crypto multi-threaded is also planned.

HelloZeroNet commented 5 years ago

I have added a multi-threaded crypto for eciesDecrypt in rev4303: https://github.com/HelloZeroNet/ZeroNet/commit/7b210429b50e48de509ef00399604b338789c468

Based on my testing the other crypto functions are too fast to efficiently move them to separate threads. (syncing with other thread has 0.1ms/call overhead)

After some profiling I found that in many cases sorting sorting file download tasks takes lots of time (more than crypto or db), so I added an optimization to it. This should significantly reduce the cpu usage for downloading/updating sites with many files: https://github.com/HelloZeroNet/ZeroNet/commit/66a950a48149f4a8be09d75ac408174c3483d0b2

HelloZeroNet commented 5 years ago

I have added multiple multi-threading related fixes in Rev4322 for bugs that made file downloads look incomplete or failed so the update is recommended if you are past Rev4287. Also added separate thread for database commits.

AyrA commented 4 years ago

Why not use multiple processes, one for each page? I know you claim that interprocess communication via TCP is slow, but unless you try to move more than half a gigabyte per second you're fine. The value was obtained by testing with a single thread on a rather old machine (Intel Core i7 CPU 960@3.20GHz), this is a 10 year old processor that really struggles to play 4k movies.

ZeroNet could provide each page under a different local address (each machine has over 16 millions after all, might as well use them). This would grant better page isolation and allows pages to be run natively in the browser rather than a sandboxed iFrame.

purplesyringa commented 4 years ago

@AryA Some unexpected problems arise like backward compatibility, proxies support, non-loopbacked addresses (hello OS X), etc.

AyrA commented 4 years ago

Of course this would break backwards compatibility, but that's what changes in the major version number are for. Just implement both systems for half a year to allow everyone to migrate.

I don't understand what you mean by proxy support. In regards to non-loopback addresses, I could not find anything either.

HelloZeroNet commented 4 years ago

It's not about bw, but serialization/synchronization.

I just did a test on function call speed:

Single thread: 0.2s
Cross-process: 105s

Multi processing would also mean that we need to load the database and all the libs to every process, so the memory usage would radically increase.

purplesyringa commented 4 years ago

@AyrA You want to use all the 127.0.0.0/16 for ZeroNet, right? First, on OS X only 127.0.0.1 is allowed, other addresses don't loopback.

AyrA commented 4 years ago

@HelloZeroNet

Multi processing would also mean that we need to load the database and all the libs to every process, so the memory usage would radically increase.

You don't have to duplicate a single thing.

You load the things that all processes need (for example the logger, tracker handler, and peer finder logic) into a master process. This means you don't even need to reference the libraries for those things in the child processes at all. The database of each site is loaded by the process that hosts the site, you don't need to load the database of every site into every process simultaneously. Even if you do, SQLite will not load things into memory that you are not asking it to.

It's not about bw, but serialization/synchronization.

I would not try to implement too much synchronization. Two sites should not be able to block eachother and/or the master process for no good reason.

All operating systems this product is intended to run on have a decades long history of optimizations in them. They all figured out pretty well for example how to handle a disk queue from multiple processes. Sites will not write to disk anyways unless they are syncing, which is not too much data anyways and you're almost certainly bottlenecked by the network and not the disk, even if a traditional rotary HDD.

Single thread: 0.2s

Cross-process: 105s

This statistic is irrelevant without seeing how many calls per second were achieved (and how this is problematic to you but not IIS, Apache, Exchange Server, etc.). They all would stall to death if this number was too low.

I'm not sure what number of calls per second you expect. The idea of having multiple processes, one for each site, is that almost no interprocess communication is necessary at all. The processes should mostly operate independently of each other apart from a few things:

Child --> Master:

Ask for peers
Send statistics
Send log messages (unless log is kept on a per-site basis)
Test if master is alive (and kill itself if not anymore)

Master --> Child:

Start child process
Ask for statistics
Send peers
Pause/Continue/Delete commands (Delete is technically not necessary as you can just kill the process and purge the directory)
Assign IP/Port for showing to user (unless given as start argument)
Send shutdown command
Notify of settings change
Check if child is alive (Does not actually needs communication as you can just check for the PID)

Python is not alone with this problem. NodeJS offers clustering in a multiprocess architecture because it has the same single-threading problem.

@imachug If only 127.0.0.1 is available you could still use multiple ports to circumvent the problem. Browsers do site isolation based on the entire origin, not the IP. This still gives you space for approx 50000 sites you could open at the same time and I've never seen a system that can simultaneously open that many browser tabs without offloading some to disk.

HelloZeroNet commented 4 years ago

Currently connections are shared between sites, so you don't need 100 connection to same peer if you want to get updated modification of 100 sites, so you would need to sync/send all data between processes.

The python process takes 32MB with all libraries loaded, before loading any site data or make any connections. So it would be 3.2GB if you have 100sites.

Also I don't see any real reason to move every site to separate process.

purplesyringa commented 4 years ago

@AyrA Ports shouldn't be used because some "browsers" (hello IE/Edge) treat different ports as same-origin. Also, opening many ports might be troublesome if you're hosting ZeroNet on some hosting provider. And not everyone uses domain names...

AyrA commented 4 years ago

you don't need 100 connection to same peer if you want to get updated modification of 100 sites, so you would need to sync/send all data between processes.

You can share sockets between processes (This is usually how multiprocess webservers do it). Also you mentioned before that the bandwidth is not an issue, so having the connection handled by the master process would not be so bad. Multiple sockets to a single client would also make it easier for a client to prioritize a site over another.

The python process takes 32MB with all libraries loaded, before loading any site data or make any connections. So it would be 3.2GB if you have 100sites.

ZeroNet currently uses 1 gigabyte after running for 10 minutes with 8 sites. It's just syncing, I have not opened one of them yet. If memory consumption of multiple processes is a concern for you, you should address this issue first and figure out what makes the process try to hold everything ever received in memory.

That's not how libraries work usually anyways. Half the idea of a library is that you can write code once and use it. The other half is that it doesn't needs to be loaded into memory n times if n applications need it, but only once.

You also don't need to have 100 processes open for 100 sites all the time. After a site has synced you can exit the process if nobody requests anything from it and have the master just wait for peer events to spin it up again (This is pretty much how FastCGI works).

If you don't want to share socket handles, networking could also be handled by an individual process (this is how an Exchange Server does it for example). The site process then merely gets file download/update notifications from the networking process and itself mostly works as a web server.

Also I don't see any real reason to move every site to separate process.

Process isolation. It can help a lot in threat mitigation. The ZeroNet security model is currently to pray that the browser sandbox works and that there is no security vulnerability in ZeroNet itself. A site that figures out how to break out of the sandbox and run in the 127.0.0.1:43110 origin essentially has full control over the ZeroNet instance at that point. You don't need a multiprocess model for this but can instead run each site on its own port, but then you run into the python-is-singlethreaded problem again with multiple tcp listeners.

HelloZeroNet commented 4 years ago

ZeroNet currently uses 1 gigabyte after running for 10 minutes with 8 sites.

That should be a badly designed site or a bug, as I running 50 sites with 70MB of memory usage.

You also don't need to have 100 processes open for 100 sites all the time.

You need to keep connection open to be able to receive updates for the sites.

The ZeroNet security model is currently to pray that the browser sandbox works

I don't see what difference would make on browser sandboxing if we move to multi processing.

purplesyringa commented 4 years ago

site that figures out how to break out of the sandbox and run in the 127.0.0.1:43110 origin essentially has full control over the ZeroNet instance at that point.

Not really. This used to be correct but it looks like almost all API calls are safe enough (i.e. there's no RCE possible and the best thing an attacker can get is private keys, but this can't be mitigated with any other sandbox).

AyrA commented 4 years ago

That should be a badly designed site or a bug, as I running 50 sites with 70MB of memory usage.

It's 5.4 GB now. Not sure what you mean by "badly designed site". A Site should not have control over what ZeroNet keeps in memory or not. This sounds like an evil DoS possibility.

You need to keep connection open to be able to receive updates for the sites.

You don't need to be able to update a site you're not using instantly. If I'm not using a site I don't see the difference between it constantly updating or only every 5 minutes.

This used to be correct but it looks like almost all API calls are safe enough (i.e. there's no RCE possible and the best thing an attacker can get is private keys, but this can't be mitigated with any other sandbox).

Getting private keys is usually an absolutely horrible thing.

I've seen many people claim that their software is safe but unless it has been audited by an independent entity that does these kind of things professionally and regularly, I will not believe you on that one, especially since you said yourself that only "almost all" calls are "safe enough".

purplesyringa commented 4 years ago

It's 5.4 GB now

O_o That's wrong for sure. What OS and ZeroNet version are you using? What sites do you seed?

purplesyringa commented 4 years ago

I will not believe you on that one, especially since you said yourself that only "almost all" calls are "safe enough".

My point was that it's not that bad as it was before. You know, there were some RCEs but everything I could find was fixed so I think it's a lot safer than before. As for private keys being stolen, that most likely won't happen because private keys aren't available via API and are only available during file signing.

HelloZeroNet commented 4 years ago

You don't need to be able to update a site you're not using instantly. If I'm not using a site I don't see the difference between it constantly updating or only every 5 minutes.

You can't delay the updates as there is central server. So if don't accept updates any time then it means other people can't make new comment/etc. anytime.

AyrA commented 4 years ago

You can't delay the updates as there is central server. So if don't accept updates any time then it means other people can't make new comment/etc. anytime.

This would mean that not a single person has the site open. If the update interval is 5 minutes and 10 people have the site, it would mean that on average, your update gets through after 30 seconds.

O_o That's wrong for sure. What OS and ZeroNet version are you using? What sites do you seed?

Windows 7 x64, 0.7.1 rev4322.

I used this to cause the memory issue: 1DdPHedr5Tz55EtQWxqvsbEXPdc4uCVi9D

While it syncs, the memory jumps up to about 1.5 GB after a while and stays there most of the time. I'm not exactly sure what causes the problem of it jumping up. I've seen it happen when I blacklist stuff (click the name of a board to get the menu) while it still has 20k files to sync. In that case you might find that blacklisting might lock up the site (and really most of ZeroNet) for minutes.

purplesyringa commented 4 years ago

your update gets through after 30 seconds

...to a single peer that will shut down soon. Now what?

AyrA commented 4 years ago

...to a single peer that will shut down soon. Now what?

The next peer can get the update when it polls for it. The chance for a peer shutting down is the same as for a new peer to appear so it's statistically irrelevant. If all possible peers go away you can't publish your update, but there is nobody around to looking at it anyways so nothing is lost.

HelloZeroNet / ZeroNet