ericmoritz / wsdemo

A Cowboy Websocket demo
Other
405 stars 58 forks source link

client is a potential bottleneck #39

Open aglyzov opened 12 years ago

aglyzov commented 12 years ago

When testing one has to ensure the client machine is powerful enough to withstand a CPU load created by the erlang client program.

On several occasions I saw the client to consume more CPU than a server on an identical pair of machines. I observed the htop output of both the client and server machines at the same time and it was clear that the erlang client was bounded by CPU while the server had a fair amount of reserve. This was especially so in the first stage of the test when new connections get created. Then, after some connections died off due to the client timeout, the client CPU usage lowered considerably.

So, assuming the testing machines are the same, the client might be a bottleneck in some cases. This needs to be checked thoroughly.

jlouis commented 12 years ago

That sounds interesting. It is also interesting because different servers still handle the connections differently. If the client was the sole problem, then all servers "able to keep up" should have roughly the same behaviour. From my initial tests on data for the handshake times only two systems exhibit the same behaviour: erlang-cowboy and go-websocket. The rest of the bunch have considerably different characteristics.

I agree this is worth investigating. I'll consider reading through the client code in order to try to figure out what it does and if I can find something odd in there, among other things.

aglyzov commented 12 years ago

@jlouis, notice, my systems were both single-core. Considering that a client does much more processing than a simplistic server, that what might caused the oddity. Also, I can confirm that almost all systems behave comparably on my tests. I should try to add another core to the client machine and try again. Thanks for the insight.

aglyzov commented 12 years ago

So guys, I added the second CPU core to my client machine, ran some tests and I now have interesting results for you.

First of all, I think my theory on the client being a bottleneck in some cases was right. Check out these screenshots to see what I mean: client (2 CPU cores) on the left, server (1 CPU cores) on the right java-webbit: https://dl.dropbox.com/u/4663634/websocket-test/java-webbit.png pypy-twisted: https://dl.dropbox.com/u/4663634/websocket-test/twisted-pypy-1.png pypy-tornado: https://dl.dropbox.com/u/4663634/websocket-test/tornado-pypy-1.png

Results: https://dl.dropbox.com/u/4663634/websocket-test/websocket-test-results.txt

On a side note: Haskell and Go were unbelievably awful in terms of memory consumption. While it is a known fact that Go has severe memory problems on 32bit architectures due to the questionable GC design, I am surprised about the haskell-snap behavior.

jlouis commented 12 years ago

That definitely looks like an overload problem on your hardware. Also note you are not even getting the 10k handshakes which Eric was getting on go and Erlang - so the faster machine Eric has might help in this case at handling all the connectivity. Perhaps you should post specs as well as Eric, so we have an idea of what kind of machine is currently needed to handle the load.

As for the 32bit limit on the Go GC, it is the price they have to pay because their language does not use a precise GC (which is a rather bad decision IMO).

aglyzov commented 12 years ago

@jlouis, I am not sure about the overload now that I have added the second core to the client. At least it is not because of CPU now. Perhaps some other hidden price of virtualization. Indeed I am eager to find out the results on a real hardware.

Notice, by displaying the screenshots I tried to show that the erlang client was consuming more than 1 CPU to compete with certain fast servers.

ericmoritz commented 12 years ago

The server hardware that I have is the following:

AMD Phenom 9600 Quad Core - 2300 mhz
2GB of Memory

The client I will be using is my Macbook Pro bootcamped into Ubuntu 12.04 (at least that is the plan)

MBP's stats:

$ sysctl hw
hw.ncpu: 8
hw.byteorder: 1234
hw.memsize: 8589934592
hw.activecpu: 8
hw.physicalcpu: 4
hw.physicalcpu_max: 4
hw.logicalcpu: 8
hw.logicalcpu_max: 8
hw.cputype: 7
hw.cpusubtype: 4
hw.cpu64bit_capable: 1
hw.cpufamily: 1418770316
hw.cacheconfig: 8 2 2 8 0 0 0 0 0 0
hw.cachesize: 8589934592 32768 262144 6291456 0 0 0 0 0 0
hw.pagesize: 4096
hw.busfrequency: 100000000
hw.busfrequency_min: 100000000
hw.busfrequency_max: 100000000
hw.cpufrequency: 2200000000
hw.cpufrequency_min: 2200000000
hw.cpufrequency_max: 2200000000
hw.cachelinesize: 64
hw.l1icachesize: 32768
hw.l1dcachesize: 32768
hw.l2cachesize: 262144
hw.l3cachesize: 6291456
hw.tbfrequency: 1000000000
hw.packages: 1
hw.optional.floatingpoint: 1
hw.optional.mmx: 1
hw.optional.sse: 1
hw.optional.sse2: 1
hw.optional.sse3: 1
hw.optional.supplementalsse3: 1
hw.optional.sse4_1: 1
hw.optional.sse4_2: 1
hw.optional.x86_64: 1
hw.optional.aes: 1
hw.optional.avx1_0: 1
hw.cputhreadtype: 1
hw.machine = x86_64
hw.model = MacBookPro8,2
hw.ncpu = 8
hw.byteorder = 1234
hw.physmem = 2147483648
hw.usermem = 943783936
hw.pagesize = 4096
hw.epoch = 0
hw.vectorunit = 1
hw.busfrequency = 100000000
hw.cpufrequency = 2200000000
hw.cachelinesize = 64
hw.l1icachesize = 32768
hw.l1dcachesize = 32768
hw.l2settings = 1
hw.l2cachesize = 262144
hw.l3settings = 1
hw.l3cachesize = 6291456
hw.tbfrequency = 1000000000
hw.memsize = 8589934592
hw.availcpu = 8
ericmoritz commented 12 years ago

Sorry, the server only has 2GB of memory. I copy/pasted that from the Craigslist Ad. One of the 2GB modules were bad, so I removed it.

I may have to pick up a 1 or 2GB module if the OS + each server start swapping.

ericmoritz commented 12 years ago

Does anyone know if I should add a "cool down" period between stopping on server and starting the other? Could there be any residual effects of one test in the kernel that could affect the result of another test?

ericmoritz commented 12 years ago

To save you some googling, the server is 64bit.

aglyzov commented 12 years ago

Once all the processes have exited/killed it should be fine. A 15 sec pause to be on the safe side.

aglyzov commented 12 years ago

Update: I've been testing the servers on a pair of linode-512 machines. The outcome is this: a basic linode hardware is capable of handling ~19k of active concurrent connections (pypy,erlang,java).

It's like 1$ a month for a 1,000 of websockets :)

ericmoritz commented 12 years ago

I like how this thing is turning into a way to benchmark VPS hosts as well as individual WS implementations.

perone commented 12 years ago

@aglyzov what was the number of active concurrent connections on the linode for the other benchs like gevent for instance ?

aglyzov commented 12 years ago

@perone gevent-websocket was not doing great unfortunately. There was a cut-off near 11k.

perone commented 12 years ago

@aglyzov thanks for sharing !