Closed ErikTheBerik closed 1 year ago
It is good starting point to understand how Drogon threading model works: https://clehaxze.tw/gemlog/2021/12-11-drogon-threading-model.gmi . I do not know exactly your application, but you have to understand, where you can sleep_for, and where you cannot. Otherwise you will send to sleep the Drogon and transfer execution to the the other application in OS (while Drogon sleeping 10 seconds). Also Drogon uses C++ coroutines to run routes in controllers: when one route call await_co, the control is transferred to other coroutine until it calls await_co or co_return and etc.
Coroutines does not help. RPS is still limited by cpu core amount. Not an issue with other frameworks.
It is good starting point to understand how Drogon threading model works: https://clehaxze.tw/gemlog/2021/12-11-drogon-threading-model.gmi . I do not know exactly your application, but you have to understand, where you can sleep_for, and where you cannot. Otherwise you will send to sleep the Drogon and transfer execution to the the other application in OS (while Drogon sleeping 10 seconds). Also Drogon uses C++ coroutines to run routes in controllers: when one route call await_co, the control is transferred to other coroutine until it calls await_co or co_return and etc.
Thanks for the link @VladlenPopolitov, it was really helpful. I was thinking about using c++20 with coroutines, but now after:
Coroutines does not help. RPS is still limited by cpu core amount. Not an issue with other frameworks.
I'm not sure coroutines will help? And what do you mean @biospb with it not being an issue with other frameworks? Are there other http frameworks where RPS is not limited by core amount?
Drogon was a bottleneck in rpc json service. neither AsyncTask or explicitly used co_await resolved it. Corotines still blocked request handler so response time for simple requests may went up to 6000-10000ms at 100+ RPS. while common request time is not more 50ms. Exactly the same code build with CPPCMS has no performance issues. Very interesting why. The only helpful use of that limiting feature was in another service that had cpu intensive part, and there was no point to run more requests simultaneously. CPPCMS ran up to 60 queries on 12 cores in that case and response times had 10x spread.
And there's no way to achieve that in drogon @biospb? I'll close the issue in that case, I thought it had something to do with the way I was using the library
Probably it can be achieved by running in separate threads instead of corotines. Also usings thread pools as it was somewhere suggested. Let's wait for other suggestions
Drogon was a bottleneck in rpc json service. @biospb , what do you mean, where is bottleneck? RPC (remote procedure call?) - I do not see any RPC in the source of Drogon? What do you call as 'json service' ? I do not see any services in the code. If there is a bottleneck, it is interesting to know, where is it.
bottleneck in handling requests. no matter what kind.
just make simple /hello handler outputting "world" after 1s wait (using std::this_thread::sleep_for(std::chrono::seconds(1)) or mutex that is locked for that time with separate thread). try to query it with something like ab -c 100 -n 100. expected total time will be much more above 1 second.
Drogon results (the same for drogon::AsyncTask and asyncHandleHttpRequest handler):
Concurrency Level: 100
Time taken for tests: 6.036 seconds
Complete requests: 100
Total transferred: 21200 bytes
Total body sent: 20000
HTML transferred: 4900 bytes
Requests per second: 16.57 [#/sec] (mean)
Time per request: 6035.700 [ms] (mean)
Time per request: 60.357 [ms] (mean, across all concurrent requests)
Transfer rate: 3.43 [Kbytes/sec] received
3.24 kb/s sent
6.67 kb/s total
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 6 1.9 6 9
Processing: 1010 1489 771.9 1021 5023
Waiting: 1001 1489 772.0 1020 5023
Total: 1010 1495 771.4 1029 5027
Percentage of the requests served within a certain time (ms)
50% 1029
66% 2020
75% 2025
80% 2027
90% 3020
95% 3025
98% 4025
99% 5027
100% 5027 (longest request)
the same test with cppcms:
Concurrency Level: 100
Time taken for tests: 2.031 seconds
Complete requests: 100
Total transferred: 16400 bytes
Total body sent: 20000
HTML transferred: 4800 bytes
Requests per second: 49.25 [#/sec] (mean)
Time per request: 2030.568 [ms] (mean)
Time per request: 20.306 [ms] (mean, across all concurrent requests)
Transfer rate: 7.89 [Kbytes/sec] received
9.62 kb/s sent
17.51 kb/s total
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 3 0.5 4 4
Processing: 1003 1008 2.7 1009 1019
Waiting: 1002 1007 2.2 1008 1015
Total: 1006 1011 2.5 1012 1019
Percentage of the requests served within a certain time (ms)
50% 1012
66% 1013
75% 1013
80% 1013
90% 1014
95% 1014
98% 1014
99% 1019
100% 1019 (longest request)
using 100 threads in config for both setups. But really drogon not use all threads if core amount is much less. the same is for corotines. Btw, I found that by default cppcms is using 5xCore for threads amount and also has internal limit. (~3x higher than drogon) 50 can be limit of AB itself.
@biospb To me this looks like a bad configuration, and I am referring to the official documentation regarding threads in this case.
using 100 threads in config for both setups.
I think this is the culprit.
using 3 threads for example for both gives such results:
Drogon:
Cppcms:
Something definetly wrong in Drogon. I will take a look on code later.
@biospb Just interesting why you mention CPPCMS again and again. I do not want to blame that framework, but it does not looks as alive project. I evaluated it some month ago and decided do not use it. Look like author lost interest to his child. Let's look:
if you compare drogon and CPPCMS, it is like comparison of Arduino and Intel80286. 80286 probably still better, but it is dead many years, but Arduino became 32bit already. I hope drogon get HTTP2 support in nearest month, and it will be incomparable with CPPCMS and similar frameworks.
yes, it is almost dead. but moving from it to Drogon caused app slowdown. So I am trying to find a coulpit.
@biospb thanks for your feedback. I think it depends on how your program is written. Please see the test results of tfb. The results of any test drogon are far ahead of cpcms. Would you like to post the source code so we can figure it out?
@biospb If you insert sleep_for in the controllers code, you will block frameworks main loop and block all controllers dependent on this thread instead of delaying the execution in your controller. If you need to make a pause in your controller for some reason, you have to use runAfter(delaySec,function) and specify delay time in seconds. Example here:
app().registerHandler(
"/hellodelay",
[](const HttpRequestPtr &,
std::function<void(const HttpResponsePtr &)> &&callback) {
// run after 1.0 sec
drogon::app().getLoop()->runAfter(1.0, [callback]() {
auto resp = HttpResponse::newHttpResponse();
resp->setBody(
"delay and Hello, world");
callback(resp);
});
return;
},
{Get});
app().registerHandler(
"/hellosleep",
[](const HttpRequestPtr &,
std::function<void(const HttpResponsePtr &)> &&callback) {
using namespace std::chrono_literals;
std::this_thread::sleep_for(1000ms); // it is not recommended - block event loop instead of pausing the controller
auto resp = HttpResponse::newHttpResponse();
resp->setBody("sleep and Hello, World!");
callback(resp);
},
{Get});
The output for hellodelay handler load test is (ab -c 100 -n 100 ):
Requests per second: 46.02 [#/sec] (mean)
Time per request: 2173.052 [ms] (mean)
Time per request: 21.731 [ms] (mean, across all concurrent requests)
Transfer rate: 8.04 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 5 7 1.7 6 20
Processing: 1027 1120 9.4 1121 1122
Waiting: 1022 1119 9.9 1121 1122
Total: 1047 1126 8.1 1127 1128
Percentage of the requests served within a certain time (ms)
50% 1127
66% 1127
75% 1128
80% 1128
90% 1128
95% 1128
98% 1128
99% 1128
100% 1128 (longest request)
1127 includes delay and overhead of the framework and granularity of the timer (it does not mean, that overhead is 127ms, probably delay timer has granularity 100ms.
Frankly speaking, if you need to asset the delay in frameworks, you have to run both of them with minimal user code, otherwise you evaluate the user code delays.
I hope, it solves your doubts.
sleep is used as something long. it can be database or some storage access, cpu intensive task and so on. it can be replaced with locked mutex. in my tests I used asyncHandler and explicity called co_await with sleep done in corotine. it still blocks execution. real application performs the same.
@biospb It is really good question - how to run cpu intensive tasks, initiated by controllers code? I hope, Drogon team will give recommendation? @an-tao , could you clarify this?
I use plugin with separate thread to run calculations. std::future created in plugin's initAndStart function, it sleeps period ofttimes waiting for work, check shutdown flag, makes work as much as it needs (without blocking any other threads, blocking SQL call blocks only itself and do not disturb database threads), and sleep again. Very short code:
#include <drogon/plugins/Plugin.h>
#include <drogon/utils/coroutine.h>
#include <future>
class p_Scheduler : public drogon::Plugin<p_Scheduler>
{
public:
p_Scheduler() {}
/// This method must be called by drogon to initialize and start the plugin.
/// It must be implemented by the user.
void initAndStart(const Json::Value &config) override;
/// This method must be called by drogon to shutdown the plugin.
/// It must be implemented by the user.
void shutdown() override;
private:
void RunScheduler();
std::future<void> savedFuture{};
std::atomic<int> runSchedulerRun{};
};
void p_Scheduler::initAndStart(const Json::Value &config)
{
/// Initialize and start the plugin
try {
runSchedulerRun.store(1);
savedFuture = std::async(
[this] {
this->RunScheduler();
});
}
catch (std::exception& e) {
ERR << "pluging start exception " << e.what(); ;
}
}
void p_Scheduler::shutdown()
{
/// Shutdown the plugin
runSchedulerRun.store(0);
try {
savedFuture.get();
}
catch (const std::exception& ) {
;
}
}
Regarding your comment about calling sleep in coroutine - if you call sleep() , it block the thread (main loop as usual), in what you called it. Coroutines are not asynchronous. Coroutines actually are "GOTO with Classes". "long_jmp" is forgotten, "goto" are forbidden, but they are needed sometimes. When you use co_await, co_return, co_yield, really program makes goto to the place, where caller called coroutine (making some additional work in methods of coroutine). When caller calls coroutine again, really it is goto to the place where last goto in coroutine was from. If program it with real goto, I would be named as stupid programmer, who do not understand simple principles. If I program the with co_await, I am modern advanced etc.
@VladlenPopolitov I don't quite agree. The goto
statement doesn't come with thread switching, but when a coroutine resumes its execution, it's possible (not necessarily) to continue in another thread. When a coroutine's sleep statement is executed, it doesn't block the current thread. Thus, coroutines are indeed asynchronous. Therefore, IMHO, coroutines should be referred to as "flattened callbacks."
The asynchronous programming paradigm aims to reduce idle waiting of threads, enabling all threads to operate efficiently. This allows a small number of threads to handle a large volume of concurrent requests. Therefore, for compute-intensive tasks, asynchronous solutions do not offer significant advantages. For such tasks, executing them directly within the current IO thread is sufficient. If there's concern about impacting the response latency of simple APIs, these tasks can also be placed in a thread pool for execution. Simultaneously, proper flow control should be implemented (please refer to the Hodor plugin) to prevent excessive requests from keeping the CPU under prolonged high load.
Note that compute-intensive tasks do not include timers, database queries, Redis queries, requesting results from other services, and similar tasks. These tasks can improve throughput by yielding the current thread during IO waiting through callbacks or coroutines.
For such tasks, executing them directly within the current IO thread is sufficient. If there's concern about impacting the response latency of simple APIs, these tasks can also be placed in a thread pool for execution.
@an-tao Would you recommend running these compute-intensive tasks using something like "trantor::ConcurrentTaskQueue"? Or what would be your recommended approach?
Simultaneously, proper flow control should be implemented (please refer to the Hodor plugin) to prevent excessive requests from keeping the CPU under prolonged high load.
What do you mean by "proper flow control" and what is the "Hodor" plugin you mention?
Right now I'm running my intensive task (takes between 100 and 500ms to run) directly on the thread I get from Drogon. I'm wondering if there's a better way to delegate that work or if there's a "best practices" for making these kind of APIs in Drogon
@ErikTheBerik trantor::ConcurrentTaskQueue look like excellent option, it is ready solution. I did test code: Declarations and definitions:
#include <trantor/utils/ConcurrentTaskQueue.h>
// define thread pool
trantor::ConcurrentTaskQueue globalThreadPool(10, "h2load test");
// declare awaitable for coroutine (my definitions to make co_await from coroutine)
using calcIntensive = std::function<void()>;
struct [[nodiscard]] ExecuteAwaiter : public CallbackAwaiter<void>
{
explicit ExecuteAwaiter(calcIntensive func)
: callAndResume_{std::move(func)}
{ }
void await_suspend(std::coroutine_handle<> handle)
{
auto taskToQueue = [this, handle]() {
try
{
this->callAndResume_();
handle.resume();
}
catch (...)
{
auto eptr = std::current_exception();
setException(eptr);
handle.resume();
}
};
globalThreadPool.runTaskInQueue(taskToQueue);
};
private:
calcIntensive callAndResume_;
};
// functin to call lambda
struct ExecuteAwaiter executeIntensiveFunction(std::function<void()> func)
{
struct ExecuteAwaiter retValue
{
std::move(func)
};
return retValue;
}
// function to call emtry lambda and return execution in the thread from threads pool
struct ExecuteAwaiter switchToThreadPull()
{
const std::function<void()> &func = []() { return; };
struct ExecuteAwaiter retValue
{
std::move(func)
};
return retValue;
}
Controllers code - version with lambda doing intensive work, and version with switching controlling to other thread from thread pool (code from hello world example):
Task<void> sleepHello(const HttpRequestPtr req,
std::function<void(const HttpResponsePtr &)> callback)
{
int someVariables{};
// current thread from mainLoop returns to mainLoop,
// coroutine continues execution in thread from thread pool
co_await switchToThreadPull();
using namespace std::chrono_literals;
// time consuming work here
std::this_thread::sleep_for(1000ms);
auto resp = HttpResponse::newHttpResponse();
resp->setBody(
"Hi there, this is another hello from the sleep1Hello Controller");
callback(resp);
co_return;
}
Task<void> sleep2Hello(const HttpRequestPtr req,
std::function<void(const HttpResponsePtr &)> callback)
{
int someVaribale{};
// current thread from mainLoop returns to mainLoop
// coroutine waits execution of lambda function in thread pool
// and continue execution in the same thread from thread pool.
co_await executeIntensiveFunction([someVaribale]() {
using namespace std::chrono_literals;
std::this_thread::sleep_for(1000ms);
return;
});
auto resp = HttpResponse::newHttpResponse();
resp->setBody("Hi there, this is another hello from the sleep2Hello Controller");
callback(resp);
co_return;
}
@biospb Load measurements are:
Requests per second: 4.78 [#/sec] (mean)
Time per request: 2093.243 [ms] (mean)
Time per request: 209.324 [ms] (mean, across all concurrent requests)
Transfer rate: 1.03 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 3 4 0.6 4 5
Processing: 1017 1022 15.6 1017 1067
Waiting: 1017 1022 15.4 1017 1066
Total: 1021 1027 15.1 1021 1069
Percentage of the requests served within a certain time (ms)
50% 1021
66% 1022
75% 1024
80% 1024
90% 1069
95% 1069
98% 1069
99% 1069
100% 1069 (longest request)
Almost all code from Drogon. I did awaiter based on Drogon code too.
For such tasks, executing them directly within the current IO thread is sufficient. If there's concern about impacting the response latency of simple APIs, these tasks can also be placed in a thread pool for execution.
@an-tao Would you recommend running these compute-intensive tasks using something like "trantor::ConcurrentTaskQueue"? Or what would be your recommended approach?
Yes, ConcurrentTaskQueue is a thread pool implementation.
Simultaneously, proper flow control should be implemented (please refer to the Hodor plugin) to prevent excessive requests from keeping the CPU under prolonged high load.
What do you mean by "proper flow control" and what is the "Hodor" plugin you mention?
Please refer to the Hodor code in Drogon.
Right now I'm running my intensive task (takes between 100 and 500ms to run) directly on the thread I get from Drogon. I'm wondering if there's a better way to delegate that work or if there's a "best practices" for making these kind of APIs in Drogon
If your server just runs these tasks, that's fine, it's simple and straightforward and has good performance guarantees, in fact, I have a machine learning algorithm service that works in the same way.
Closing this issue since it satisfied my needs. Thank you very much for your help :)
I tried googling this and searching for answers in past Issues, but I couldn't really find my answer even tho some Issues where somewhat similar. Also, I am very new to c++ async, futures, etc. I have only ever used std::thread; so this might be a really dumb problem I'm having.
I have a drogon server running with 16 threads; my CPU is an i9 with 8 cores (macbook pro). I have a route "/test" that runs a very slow function and then returns "OK". For now the function is just:
std::this_thread::sleep_for(std::chrono::seconds(10));
Now, if I request this route 16 times, the following happens:
1st request: 10 seconds (as expected) 2nd request: 20 seconds (seems like the first request is blocking next requests?) 3rd request: 30 seconds (same as above) 4th request: 40 seconds (same as above) 5th - 10th request: 50 seconds (suddenly, after 4 requests, 6 requests are being run at the same time) 11th - 16th request: 60 seconds (again, 6 requests at a time)
I don't understand why the initial 4 requests run 1 request at a time. Also, once the requests run simultaneously, why is it only running 6 requests at a time when it's supposedly using 16 threads (although just 8 cores).
I'm not sure if I'm supposed to run the "really slow function" using a new thread, using std::async with std::future or what.