Closed eechen closed 6 years ago
Hi @eechen I changed the pm.max_children in #3291 where the changes are explained.
In this case was exactly because PHP-FPM was sending this error:
WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers)...
It's only in the plain text benchmark where the max concurrency is 16,384.
But for the other tests, the benchmark changed the concurrency from 256 to 512. So I changed the pm.start_servers = 256
to 512 and the rest according to this change.
I think you're confusing Master process and Child processes. Like nginx, nodjes, go, ... use. For example nginx, they recommend one or one and half Master process(worker_processes) per CPU. And by default they have 512 maximum child processes (worker_connections) by master process. The worker_connections is more dependent of your application: memory, cpu or IO. If your application need 10 Mb per child, you need to calculate to not use more than your server memory.
The theory is good but I prefer the real data, and more with great performance servers. I can only test with my 8 cores PC. But unfortunately :sob: I have not seen the results yet from pr #3265 and #3291 in the continuous benchmarking for adjusting better the php stack. But the change to use Docker is finishing and We will soon see the results.
In PHP's tests, the interprocess communication of Nginx and PHP-FPM also consumes significant CPU. And do you disable Nginx's access_log in PHP's tests?
You can look this test, PHP runs with Swoole(PECL), without Nginx, is faster than Go's HTTP server. https://github.com/swoole/swoole-src/issues/1401 Because Swoole has a built-in HTTP server like Node.js.
Golang
Requests per second: 1475.43 [#/sec] (mean)
PHP + Nginx
Requests per second: 32.63 [#/sec] (mean)
PHP + Nginx + Pconnect!
Requests per second: 36.18 [#/sec] (mean)
PHP + Swoole
Requests per second: 1036.04 [#/sec] (mean)
PHP + Swoole + Pconnect!
Requests per second: 1571.12 [#/sec] (mean)
I also do the test JSON serialization on my laptop (Ubuntu 14.04 with i5-3230M).
<?php
$server = new swoole_http_server('0.0.0.0', 8080);
$server->set(array(
'worker_num' => 4, // i5-3230M with 4 cores
));
$server->on('request', function ($req, $res) {
$res->header('Content-type', 'application/json');
$res->end(json_encode(array('message' => 'Hello, World!')));
});
$server->start();
The ApacheBench test result:
ab -k -c100 -n100000 http://10.42.0.1:8080/
Server Software: swoole-http-server
Server Hostname: 10.42.0.1
Server Port: 8080
Document Path: /
Document Length: 27 bytes
Concurrency Level: 100
Time taken for tests: 1.596 seconds
Complete requests: 100000
Failed requests: 0
Keep-Alive requests: 100000
Total transferred: 18700000 bytes
HTML transferred: 2700000 bytes
Requests per second: 62670.03 [#/sec] (mean)
Time per request: 1.596 [ms] (mean)
Time per request: 0.016 [ms] (mean, across all concurrent requests)
Transfer rate: 11444.62 [Kbytes/sec] received
Swoole use 4 cores to get 60k RPS. I think Swoole can get 600k RPS with 40 cores.
TechEmpower's environment CPU has 40 cores. The TechEmpower PHP test's RPS is even less 40k.
Such a great gap, the reason is not PHP, but the difference between Swoole and Nginx/PHP-FPM.
@eechen
but the difference between Swoole and Nginx/PHP-FPM.
The issue with PHP is that it bootstraps all the content on each request.
For instance, let say you want to use a popular framework. It bootstraps all the configurations files, the autoloader, dependancy injection scripts, whatever else. Does it tasks. Returns the output and ... then dumps all that work to redo it on the next cycle. The advance is safety because no data from one requests can spill over to the next but the disadvantage is that each new requests requires this entire circus again and again.
Swoole on the other hand bootstraps all that cycle on the first request but reuses it on all feature requests. Add to this the build in HTTP servers means no need for external communication.
As a result, it acts the similar to a compiled language like Go how it handles the requests.
Swoole use 4 cores to get 60k RPS. I think Swoole can get 600k RPS with 40 cores.
And yes, Swoole will easily hit 600k RPS on that amount of cores.
Just for fun i ran your code under Windows WSL! Requests per second: 112640.27 [#/sec] (mean)
And that is one a 6 core i5 @ 4.7Ghz, And Windows WSL has the habit of eating some performance.
If we want to have a more fair testing, maybe Swoole needs to be added to the TechEmpower site. If somebody has the time to add it to the pull requests, i am sure it will not be rejected.
PHP-FPM bootstraps all the content on each request, that consumes CPU surely. But in some simple tests, like JSON serialization, it should not influence such obviously.
I do the test JSON serialization on my laptop (Ubuntu 14.04 with i5-3230M) again. But this time, I use the PHP built-in HTTP server, no Nginx, no log, no interprocess communication.
The code:
<?php
header('Content-type: application/json');
echo json_encode(array('message' => 'Hello, World!'));
?>
The test:
nohup php -S 0.0.0.0:8080 -t ./ >/dev/null 2>&1 &
ab -k -c100 -n100000 http://127.0.0.1:8080/
The result:
Server Software:
Server Hostname: 127.0.0.1
Server Port: 8080
Document Path: /
Document Length: 27 bytes
Concurrency Level: 100
Time taken for tests: 8.324 seconds
Complete requests: 100000
Failed requests: 0
Keep-Alive requests: 0
Total transferred: 18400000 bytes
HTML transferred: 2700000 bytes
Requests per second: 12013.38 [#/sec] (mean)
Time per request: 8.324 [ms] (mean)
Time per request: 0.083 [ms] (mean, across all concurrent requests)
Transfer rate: 2158.65 [Kbytes/sec] received
The PHP built-in HTTP server only one process without threads, uses one core of CPU, can get 12k RPS. Compare with Swoole's 60/4=15k/core RPS, the difference is not so big.
At this point your only testing C calls... That test is flawed for several reasons.
Nobody has ever said that pure PHP functions are slow because PHP technically is nothing but a bunch of C libraries ( gross overstatement ). The issue becomes when each request needs to go over a webserver like Nginx and this constant bootstrapping. I can show you code where i did the same tests with Go, PHP, Swoole and of course raw PHP did almost as good as the rest. Nothing to bootstrap, no inter communication, ... Its really just testing how good a language in the raw performance, what is no surprise when we are technically calling C libraries.
By the way: You do know that using the build in PHP webserver to use in production is massive dangerous? As your not allowed to use it for any production work. It gets no security patches.
http://php.net/manual/en/features.commandline.webserver.php
This web server was designed to aid application development. It may also be useful for testing purposes or for application demonstrations that are run in controlled environments. It is not intended to be a full-featured web server. It should not be used on a public network.
The reason why Swoole, Go and ... there performance is good because they bootstrap the annoying items only one time ( on the initial startup ) like setting up a autoloader, establishing DB communication, checking and opening a file handler to a log or whatever the framework your using does. Raw PHP keeps redoing that on each and every request. Understand? This is why your test is flawed. All it does is show that PHP without all the interpreter handling, is not slow ( what is no surprise for most people ).
@Wulfklaue The PHP built-in HTTP server is designed for development surely, because it only has one process, can not use multi CPU cores. But it is a HTTP server undoubtedly.
In TechEmpower's JSON serialization test, PHP only has 40k RPS with 40 CPU cores, that is only 1k/core RPS. Compare with my test's 12k/core RPS, the difference is so big, but why?
https://github.com/TechEmpower/FrameworkBenchmarks/blob/master/frameworks/PHP/php/json.php
https://www.techempower.com/benchmarks/#section=data-r15&hw=ph&test=json
As stated a several times, its the nature of the beast. And so many different factors can contribute to the lower response on TechEmpowered.
Hardware:
Your CPU core are newer with a higher base speed so it can push more instructions per core. The 40 ( 40 + 40 HT ) Core server they use is a multi socket system and multi socket also introduces latency with inner communication. Add to this servers tend to run FB-DIMM what is different beast then your Unbuffered-Dimms.
Hell, maybe your CPU is using SSL instructions that the 40 core does not have. Take in account that CPU is from 2011. Your was from 2013. Changes happen all the time to CPU designs.
That is one factor. And we do not even go into L1, L2, L3 cache speeds, hits and misses. A one thread PHP webserver can enjoy the full cache for itself without switching out data.
You see how fast hardware differences stack up? The hardware they bench on is freaking old. I have gone in the past from a dual socket 12 core, 24 thread to a 8 core, 16 thread single socket system. You expect me to lose performance, no ... i gained 50% more performance. Just because of hardware differences.
PS: Your CPU is only a 2 Core ( 2 + 2 HT ).
Programs tend to do way better also in single thread vs multi thread as this introduces way more issues like context switching and other delays. People think if i go from 2 core to 4 core ( assuming your cache etc all double to match ), then my programs will be double as fast. Wrong ... You gain maybe 90% at best as multi threading introduces issues and overhead.
Communications:
All requests go over Nginx. This is another bottleneck because Nginx needs to handle each request, make contact with PHP-FPM. Wait for the response and give you back the response. Unlike a build in Webserver this massive delays communication. If Nginx is not finely tuned for handling a specific load, it will server as a bottleneck. Each load is different... You can probably fine tune Nginx to better respond to the json benchmark but as a result it may hurt the other benchmarks.
Network:
Your forgetting a important part. You are local benching on your system! That means you do not have any network communication delays. TechEmpower simulates a more real world scenario with clients contacting the servers, setting up a communication, waiting for responds, then sending there request ...
Local machine testing is never the same as network testing. Run a local echo "hello world" server, now run the same echo server on a remote hardware and you will see a drop in performance simple because of the communication layer delays. A standard internal network has somewhere between 0.1 second delay, do that with thousands of requests and it stacks up. Try doing a external test and its even worse.
PHP:
As stated a lot before, for every request PHP needs to accept the communication. This is a MASSIVE hit in performance. Check the cache ( if you use APC cache ), run they bytecode ( transform this to actual executable code ), perform whatever code, get the results and return it over this same communication.
OS: ... can go one for a while.
There are so many factors at play... This is why benchmarks are the root of all evil. Even small and stupid details can have a massive effect on performance. You keep comparing a system that is as alien then what you are running hardware size, under totally different testing conditions network, under different configuration conditions.
This is the last responds from me on this subject as it adds nothing useful to the discussion to solve the issue. Because the issue can not be solved without having direct access to the hardware to see and fine tune each part to gain the maximum performance.
And that is not what TechEmpowered there test is about. Its about comparing languages on the same system, how they stack up against each other. NOT how languages stack up on different systems.
@Wulfklaue I do the test JSON serialization with the same machine and system (my laptop) for Node.js.
var http = require("http");
http.createServer(function (req, res) {
res.writeHead(200, {"Content-Type": "application/json"});
res.end(JSON.stringify({"message": "Hello, World!"}));
}).listen(8080, "0.0.0.0");
ab -k -c100 -n100000 http://127.0.0.1:8080/
Server Software:
Server Hostname: 127.0.0.1
Server Port: 8080
Document Path: /
Document Length: 27 bytes
Concurrency Level: 100
Time taken for tests: 10.163 seconds
Complete requests: 100000
Failed requests: 0
Keep-Alive requests: 0
Total transferred: 13400000 bytes
HTML transferred: 2700000 bytes
Requests per second: 9839.81 [#/sec] (mean)
Time per request: 10.163 [ms] (mean)
Time per request: 0.102 [ms] (mean, across all concurrent requests)
Transfer rate: 1287.63 [Kbytes/sec] received
The Node.js RPS only less than 10k/core, slower than PHP's built-in HTTP server. In TechEmpower's test for Node.js, RPS is 400k/40cores, about 10k/core. It's close to my test result. But why the difference of PHP's test result is so big?
Node.js uses an event-driven, non-blocking I/O model that makes it lightweight and efficient.
In just about any benchmark with PHP vs Node.js wins thanks to its coroutines. Swoole also uses this... result = Swoole kicks raw PHP its behind.
Nginx does a request. PHP processes this. When its finished it returns. This is a blocking process. If Nginx gets more requests, it starts new threads. This has a limit in the config how many it can do at the same time. PHP-FPM gets a request, it starts a PHP instance ( if the requests are more then the pre-loaded instances = slow ). It processes the request and returns. But! Both Nginx and PHP-FPM have build in limits on how many requests they can handle ( this again depends on what server you run, how you configure Nginx and PHP-FPM to get the maximal out of the server ). Think of it like a 1:1 relationship. Request => Response. As long as the responds is not fulfilled, that thread is being blocked.
Node and Swoole can process the request, while its waiting for the result of Request 1, it can process more requests.
In other words, when Swoole or NodeJS gets a request, it can use a coroutine to split this request. It performs part of the operation. While its busy and waiting on Request 1, it can start putting Request 2 down the pipeline. Aka non-blocking IO. If you combine this with a thread scheduler, you have a non-blocking system.
I advice reading up on coroutines and threads because it can get complicated. This is one of the reasons why people like Go so much. It has build in coroutines + automatic thread scheduling and as a result, its almost plug and play, and can maximize its performance out of the door. Where as PHP is limited to the single process requests and has no real support build in for coroutines.
You will notice if you use Swoole and you are not careful that using some specific functions can leak memory because PHP is designed around that linear: request, perform and dump everything.
And please stop using the techempower issue tracker as a way to learn how specific servers and software work. Can somebody close this thread because this does not belong as a open issue.
You're both right.
@eechen The php stack configuration isn't correct in the last rounds. I send this issue: Verify the PHP results #2717 after round 14. And these 2 PR: PHP twice as fast #3265 and Update and reconfigure PHP stack #3291. But in round 15 those changes are not included yet.
@Wulfklaue Yes, php is different, have advantages and disadvantages. And for performance, bootstrap for every requests is a big disadvantage.
This benchmark is for fullstack and micro frameworks, and also platforms (go, nodejs, php, ...) Like you said, php is using C calls, so a raw php will be similar to other platforms. Check the fortunes test in round 12 , php5 without an optimal config is almost as fast as go-prefork. Later the frameworks will add overhead. From go-prefork to gin and from nodejs to express, the frameworks send half the req/s.
I think there is a general misconception that php is slow. If go was slower than nodejs in the benchmark, alarm! in the community. But if it's php slower, it's normal. Actually php is similar in performance to others platforms.
@eechen I am not linked to ThechEmpower in any way. I only help in my little free time, like I do in other open source projects. The php community need to help for have an optimal php stack config.
@eechen
I'm working now in the php stack configuration with the new docker benchmark. I hope the next week will be ready. If the results aren't similar to the round 12, I will check all the changes from round 12 to 13.
But perhaps is a regression in the Nginx fastcgi module. It's for that than I want to see more php stacks, from apache, h20, ... to swoole. First we will know which one is faster and second see regressions when update.
Nate added the swoole json test, but you or the swoole community could add the rest of the tests.
Swoole is faster and scale better than nodejs and go. In the vagrant box (2 cores and 3Gb) with this benchmark.
PHP | PHP Swoole | nodejs | go |
---|---|---|---|
96447 | 199288 | 132686 | 178363 |
Total requests in 15 seconds (json test).
Don't expect here than any fastcgi is faster. All the advantages are for the frameworks with integrated server. Point.
And remember, the faster the server, the greater the difference.
This test is more similar to a real app. Here the differences narrow. And we can see php near from others.
Add to the fortune test : compression (gzip or brotli), ssl, static files, ... We don't have numbers here, you will need to try for your app. What to do ? Include all that in the framework server or add a web server for that? The numbers change again.
Possibly you don't need that for an API. But a lot of times are behind a proxy (Nginx or other) because are microservices or for another reason.
Benchmarks are really useful, but we need to understand these numbers in a context.
Depending on the app we can use a framework or another, but it's most important to use it correctly.
@joanhey Thanks for that detailed write up.
@eechen @Wulfklaue We don't mind nor discourage discussions like this. A better place for it might be the google group as we want to try and use the issues tracker for actual issues. Thanks everyone.
If the test server's CPU has 40 cores, I think the "max_children" should adjust to 40 or 60.
In other words, the number of PHP-FPM processes , should be one and a half times as more as CPU cores.
Because it can reduce CPU context switch significantly in TechEmpower's tests. Just like Go's "runtime.GOMAXPROCS(runtime.NumCPU())".
pm = static (TechEmpower's test use "dynamic") pm.max_children = 60 (TechEmpower's test use "1024")
Thanks.