Using custom memory allocator (sn_malloc or mimalloc) in FrameworkBenchmark for Drogon

rohitjoshi commented 4 years ago

I see Rust based frameworks actix and may_minhttp both using sn_malloc and mimalloc allocator which shows significant improvement. Have you tried including custom allocator for Drogon benchmarks?

an-tao commented 4 years ago

@rohitjoshi thanks for your information. I haven't tried including custom allocator for it. actually I'm glad to give it a try. but I'm not very familiar with this, do you have any suggestions to achieve it?

rohitjoshi commented 4 years ago

Let me submit a PR. It can be included either in a make file or even as a startup parameter

an-tao commented 4 years ago

Thank you so much.

rohitjoshi commented 4 years ago

See oatpp

an-tao commented 4 years ago

@rohitjoshi I've submitted a PR to tfb, thanks a lot.

rohitjoshi commented 4 years ago

Great, did you update other docker file as well? I am not sure which one is getting used.

an-tao commented 4 years ago

Yes, I updated docker files. Unfortunately, it cannot be installed in ubuntu with apt command, I had to installed it by source code. thanks.

rohitjoshi commented 4 years ago

Any measurable performance improvement? It would be good to compare snmalloc vs mimalloc. Also I see many other frameworks are cheating to improve performance. E.g. using hardcoded content length

an-tao commented 4 years ago

On my local computer, it can introduce 10% -15% QPS improvement, which is exciting. I know some frameworks tests are written in some strange way, there are always loopholes in the rules, but I hope the code style in tfb tests should look like code in normal production systems. thanks for your information.

rohitjoshi commented 4 years ago

In the recent benchmark , drogon-core is 2nd in the place. Hopefully with 10% increase, it will be #1.

an-tao commented 4 years ago

Yes :) And the queries and the update test will be #1 further

ihmc3jn09hk commented 4 years ago

Cool, it's pretty exciting to see the next tbf results.

rohitjoshi commented 4 years ago

multiquery

40% higher throughput compared to 3nd number (may-minihttp) in the multiple queries category.

mjp41 commented 4 years ago

If you do get a comparison between mimalloc and snmalloc would be very interested to see the results. Also, if you have issues with using snmalloc please post to our github.

an-tao commented 4 years ago

@mjp41 , thank you very much. I have done some tests on the memory allocator. The test results are as follows:

1. normal malloc

wrk -c512 -d15 -t8 http://localhost:8088/plaintext -s pipeline.lua -- 64
Running 15s test @ http://localhost:8088/plaintext
  8 threads and 512 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     4.38ms    8.19ms 348.77ms   96.26%
    Req/Sec   555.07k    48.57k    1.09M    84.18%
  66323456 requests in 15.10s, 7.84GB read
Requests/sec: 4392607.08
Transfer/sec:    532.02MB

2.snmalloc

wrk -c512 -d15 -t8 http://localhost:8088/plaintext -s pipeline.lua -- 64
Running 15s test @ http://localhost:8088/plaintext
  8 threads and 512 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     3.45ms    3.43ms 224.28ms   87.17%
    Req/Sec   646.79k    40.61k    1.26M    85.02%
  77325440 requests in 15.10s, 9.15GB read
Requests/sec: 5120800.20
Transfer/sec:    620.21MB

3.mimalloc

wrk -c512 -d15 -t8 http://localhost:8088/plaintext -s pipeline.lua -- 64
Running 15s test @ http://localhost:8088/plaintext
  8 threads and 512 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     3.51ms    3.79ms 214.67ms   88.30%
    Req/Sec   645.94k    42.95k  702.71k    84.33%
  77111168 requests in 15.02s, 9.12GB read
Requests/sec: 5132390.37
Transfer/sec:    621.62MB

I can say that in this test, snmalloc and mimalloc have similar performance improvements.

mjp41 commented 4 years ago

Interesting. The thread stats for snmalloc look fractionally better, but the overall stats for mimalloc look fractionally better. I could well imagine the noise in the benchmark is greater than the differences. Thanks for sharing the results.

an-tao commented 4 years ago

Yes, I agree with you, thanks for your excellent work.

rbugajewski commented 4 years ago

These preliminary benchmarks look really great. Thanks for the contributions and the awesome work which made the framework even better.

ihmc3jn09hk commented 4 years ago

Not winning by 10% tho, still good to see the results. I was guessing the gap for the ORM version would have been tightened.

vedranmiletic commented 4 years ago

I don't want to open a new issue so I'll post it here. Drogon has won the first place in the Composite framework score section of the Round 19 of TechEmpower benchmarks posted two days ago.

drogon-wins

Congratulations!

rbugajewski commented 4 years ago

That’s some awesome news! 🎉

Thanks for sharing @vedranmiletic.

Thanks to @an-tao and all other contributors for the great work and reaching this awesome milestone 🙂

deklanw commented 4 years ago

Out of curiosity, @an-tao do you have any interest in writing a blogpost or something about how you achieved this performance? Seems like it could be very helpful for people.

Great work!

an-tao commented 4 years ago

@deklanw Thanks for your attention. I think drogon's high performance benefits from the following:

Completely non-blocking programing: drogon provides asynchronous interfaces for users to handle HTTP requests and non-blocking asynchronous database interfaces to access the database. Therefore, users can use a small number of threads (usually the number of CPU cores) to process very large requests at the same time. The only blocking point in each thread is the epoll_wait() call, when there is really nothing to do, the CPU is blocked there.
Lock-free: It goes without saying that the critical section protected by a global mutex is harmful to the concurrent performance of a program. I spent a lot of time to remove locks in the framework, including using lock-free queues, FastDbClient, etc. Finally, the execution path of drogon is almost no locks, which means that each thread can run at full speed without waiting for each other.
Batch-mode of libpq: there is a batch-mode patch for libpq which can pipeline SQL queries into asynchronous batches in the same connection, this is very helpful for increasing the usage rate of database connections. see here for more details.

Basically the techniques mentioned above are used in almost all the top TFB frameworks. I think the gap between the top frameworks is the result of the accumulation of many implementation details, and of course the differences between the programing languages.

drogonframework / drogon

Using custom memory allocator (sn_malloc or mimalloc) in FrameworkBenchmark for Drogon #419