Running slow function without blocking requests

ErikTheBerik commented 1 year ago

I tried googling this and searching for answers in past Issues, but I couldn't really find my answer even tho some Issues where somewhat similar. Also, I am very new to c++ async, futures, etc. I have only ever used std::thread; so this might be a really dumb problem I'm having.

I have a drogon server running with 16 threads; my CPU is an i9 with 8 cores (macbook pro). I have a route "/test" that runs a very slow function and then returns "OK". For now the function is just:

std::this_thread::sleep_for(std::chrono::seconds(10));

Now, if I request this route 16 times, the following happens:

1st request: 10 seconds (as expected) 2nd request: 20 seconds (seems like the first request is blocking next requests?) 3rd request: 30 seconds (same as above) 4th request: 40 seconds (same as above) 5th - 10th request: 50 seconds (suddenly, after 4 requests, 6 requests are being run at the same time) 11th - 16th request: 60 seconds (again, 6 requests at a time)

I don't understand why the initial 4 requests run 1 request at a time. Also, once the requests run simultaneously, why is it only running 6 requests at a time when it's supposedly using 16 threads (although just 8 cores).

I'm not sure if I'm supposed to run the "really slow function" using a new thread, using std::async with std::future or what.

VladlenPopolitov commented 1 year ago

It is good starting point to understand how Drogon threading model works: https://clehaxze.tw/gemlog/2021/12-11-drogon-threading-model.gmi . I do not know exactly your application, but you have to understand, where you can sleep_for, and where you cannot. Otherwise you will send to sleep the Drogon and transfer execution to the the other application in OS (while Drogon sleeping 10 seconds). Also Drogon uses C++ coroutines to run routes in controllers: when one route call await_co, the control is transferred to other coroutine until it calls await_co or co_return and etc.

biospb commented 1 year ago

Coroutines does not help. RPS is still limited by cpu core amount. Not an issue with other frameworks.

ErikTheBerik commented 1 year ago

It is good starting point to understand how Drogon threading model works: https://clehaxze.tw/gemlog/2021/12-11-drogon-threading-model.gmi . I do not know exactly your application, but you have to understand, where you can sleep_for, and where you cannot. Otherwise you will send to sleep the Drogon and transfer execution to the the other application in OS (while Drogon sleeping 10 seconds). Also Drogon uses C++ coroutines to run routes in controllers: when one route call await_co, the control is transferred to other coroutine until it calls await_co or co_return and etc.

Thanks for the link @VladlenPopolitov, it was really helpful. I was thinking about using c++20 with coroutines, but now after:

Coroutines does not help. RPS is still limited by cpu core amount. Not an issue with other frameworks.

I'm not sure coroutines will help? And what do you mean @biospb with it not being an issue with other frameworks? Are there other http frameworks where RPS is not limited by core amount?

biospb commented 1 year ago

Drogon was a bottleneck in rpc json service. neither AsyncTask or explicitly used co_await resolved it. Corotines still blocked request handler so response time for simple requests may went up to 6000-10000ms at 100+ RPS. while common request time is not more 50ms. Exactly the same code build with CPPCMS has no performance issues. Very interesting why. The only helpful use of that limiting feature was in another service that had cpu intensive part, and there was no point to run more requests simultaneously. CPPCMS ran up to 60 queries on 12 cores in that case and response times had 10x spread.

ErikTheBerik commented 1 year ago

And there's no way to achieve that in drogon @biospb? I'll close the issue in that case, I thought it had something to do with the way I was using the library

biospb commented 1 year ago

Probably it can be achieved by running in separate threads instead of corotines. Also usings thread pools as it was somewhere suggested. Let's wait for other suggestions

VladlenPopolitov commented 1 year ago

Drogon was a bottleneck in rpc json service. @biospb , what do you mean, where is bottleneck? RPC (remote procedure call?) - I do not see any RPC in the source of Drogon? What do you call as 'json service' ? I do not see any services in the code. If there is a bottleneck, it is interesting to know, where is it.

biospb commented 1 year ago

bottleneck in handling requests. no matter what kind.

just make simple /hello handler outputting "world" after 1s wait (using std::this_thread::sleep_for(std::chrono::seconds(1)) or mutex that is locked for that time with separate thread). try to query it with something like ab -c 100 -n 100. expected total time will be much more above 1 second.

Drogon results (the same for drogon::AsyncTask and asyncHandleHttpRequest handler): Concurrency Level: 100 Time taken for tests: 6.036 seconds Complete requests: 100 Total transferred: 21200 bytes Total body sent: 20000 HTML transferred: 4900 bytes Requests per second: 16.57 [#/sec] (mean) Time per request: 6035.700 [ms] (mean) Time per request: 60.357 [ms] (mean, across all concurrent requests) Transfer rate: 3.43 [Kbytes/sec] received 3.24 kb/s sent 6.67 kb/s total

Connection Times (ms) min mean[+/-sd] median max Connect: 0 6 1.9 6 9 Processing: 1010 1489 771.9 1021 5023 Waiting: 1001 1489 772.0 1020 5023 Total: 1010 1495 771.4 1029 5027

Percentage of the requests served within a certain time (ms) 50% 1029 66% 2020 75% 2025 80% 2027 90% 3020 95% 3025 98% 4025 99% 5027 100% 5027 (longest request)

the same test with cppcms: Concurrency Level: 100 Time taken for tests: 2.031 seconds Complete requests: 100 Total transferred: 16400 bytes Total body sent: 20000 HTML transferred: 4800 bytes Requests per second: 49.25 [#/sec] (mean) Time per request: 2030.568 [ms] (mean) Time per request: 20.306 [ms] (mean, across all concurrent requests) Transfer rate: 7.89 [Kbytes/sec] received 9.62 kb/s sent 17.51 kb/s total

Connection Times (ms) min mean[+/-sd] median max Connect: 0 3 0.5 4 4 Processing: 1003 1008 2.7 1009 1019 Waiting: 1002 1007 2.2 1008 1015 Total: 1006 1011 2.5 1012 1019

Percentage of the requests served within a certain time (ms) 50% 1012 66% 1013 75% 1013 80% 1013 90% 1014 95% 1014 98% 1014 99% 1019 100% 1019 (longest request)

using 100 threads in config for both setups. But really drogon not use all threads if core amount is much less. the same is for corotines. Btw, I found that by default cppcms is using 5xCore for threads amount and also has internal limit. (~3x higher than drogon) 50 can be limit of AB itself.

rbugajewski commented 1 year ago

@biospb To me this looks like a bad configuration, and I am referring to the official documentation regarding threads in this case.

rbugajewski commented 1 year ago

using 100 threads in config for both setups.

I think this is the culprit.

biospb commented 1 year ago

using 3 threads for example for both gives such results:

Drogon:

Cppcms:

Something definetly wrong in Drogon. I will take a look on code later.

VladlenPopolitov commented 1 year ago

@biospb Just interesting why you mention CPPCMS again and again. I do not want to blame that framework, but it does not looks as alive project. I evaluated it some month ago and decided do not use it. Look like author lost interest to his child. Let's look:

Codebase is not clear - where is it located, where is the last version. Some pages refers to version 1.2.x, some pages mention 2.0, but it exists only in source forge in zip format and has name 'beta', and quite outdated.
There is no clear documentation where to start and go step by step. Only pieces of samples, that do not show the full picture. If you compare with drogon, it has step by step documentation , that allows in the matter of days jump in and do productive code.
C++ version of CPPCMS outdated. Author did attempt to update it, but this attempts somewhere in Zip file in beta version. Old versions of C++ already are not usable now. For example, MSVC 2022 does not allow set C++11 compatibility mode, C++14 is minimum. If you have outdated source, you even cannot compile it until change all obsolete features.
Not clear, how to contact CPPCMS author and ask improvements or send bug reports. In drogon you can write here - very easy, team behaves alive.
Last - lets look at the home page of the CPPCMS - what is the header of the home page and every page? "CppCMS - C++ Web Framework. sdfsdfdsf dsf" . Do you still consider this framework and this site alive?

if you compare drogon and CPPCMS, it is like comparison of Arduino and Intel80286. 80286 probably still better, but it is dead many years, but Arduino became 32bit already. I hope drogon get HTTP2 support in nearest month, and it will be incomparable with CPPCMS and similar frameworks.

biospb commented 1 year ago

yes, it is almost dead. but moving from it to Drogon caused app slowdown. So I am trying to find a coulpit.

an-tao commented 1 year ago

@biospb thanks for your feedback. I think it depends on how your program is written. Please see the test results of tfb. The results of any test drogon are far ahead of cpcms. Would you like to post the source code so we can figure it out?

VladlenPopolitov commented 1 year ago

@biospb If you insert sleep_for in the controllers code, you will block frameworks main loop and block all controllers dependent on this thread instead of delaying the execution in your controller. If you need to make a pause in your controller for some reason, you have to use runAfter(delaySec,function) and specify delay time in seconds. Example here:

app().registerHandler(
        "/hellodelay",
        [](const HttpRequestPtr &,
           std::function<void(const HttpResponsePtr &)> &&callback) {
            // run after 1.0 sec
            drogon::app().getLoop()->runAfter(1.0, [callback]() {
                auto resp = HttpResponse::newHttpResponse();
                resp->setBody(
                    "delay and Hello, world");
                callback(resp);
            });
            return;
        },
        {Get});
    app().registerHandler(
        "/hellosleep",
        [](const HttpRequestPtr &,
           std::function<void(const HttpResponsePtr &)> &&callback) {
            using namespace std::chrono_literals;
            std::this_thread::sleep_for(1000ms); // it is not recommended - block event loop instead of pausing the controller
            auto resp = HttpResponse::newHttpResponse();
            resp->setBody("sleep and Hello, World!");
            callback(resp);
        },
        {Get});

The output for hellodelay handler load test is (ab -c 100 -n 100 ):

Requests per second:    46.02 [#/sec] (mean)
Time per request:       2173.052 [ms] (mean)
Time per request:       21.731 [ms] (mean, across all concurrent requests)
Transfer rate:          8.04 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        5    7   1.7      6      20
Processing:  1027 1120   9.4   1121    1122
Waiting:     1022 1119   9.9   1121    1122
Total:       1047 1126   8.1   1127    1128

Percentage of the requests served within a certain time (ms)
  50%   1127
  66%   1127
  75%   1128
  80%   1128
  90%   1128
  95%   1128
  98%   1128
  99%   1128
 100%   1128 (longest request)

1127 includes delay and overhead of the framework and granularity of the timer (it does not mean, that overhead is 127ms, probably delay timer has granularity 100ms.

Frankly speaking, if you need to asset the delay in frameworks, you have to run both of them with minimal user code, otherwise you evaluate the user code delays.

I hope, it solves your doubts.

biospb commented 1 year ago

sleep is used as something long. it can be database or some storage access, cpu intensive task and so on. it can be replaced with locked mutex. in my tests I used asyncHandler and explicity called co_await with sleep done in corotine. it still blocks execution. real application performs the same.

VladlenPopolitov commented 1 year ago

@biospb It is really good question - how to run cpu intensive tasks, initiated by controllers code? I hope, Drogon team will give recommendation? @an-tao , could you clarify this?

I use plugin with separate thread to run calculations. std::future created in plugin's initAndStart function, it sleeps period ofttimes waiting for work, check shutdown flag, makes work as much as it needs (without blocking any other threads, blocking SQL call blocks only itself and do not disturb database threads), and sleep again. Very short code:

#include <drogon/plugins/Plugin.h>
#include <drogon/utils/coroutine.h>
#include <future>

class p_Scheduler : public drogon::Plugin<p_Scheduler>
{
  public:
    p_Scheduler() {}
    /// This method must be called by drogon to initialize and start the plugin.
    /// It must be implemented by the user.
    void initAndStart(const Json::Value &config) override;

    /// This method must be called by drogon to shutdown the plugin.
    /// It must be implemented by the user.
    void shutdown() override;
private:
    void RunScheduler();
    std::future<void> savedFuture{};
    std::atomic<int> runSchedulerRun{};
};

void p_Scheduler::initAndStart(const Json::Value &config)
{
    /// Initialize and start the plugin
    try {
        runSchedulerRun.store(1);
        savedFuture = std::async(
            [this] {
                this->RunScheduler();
            });
    }
    catch (std::exception& e) {
        ERR << "pluging start exception " << e.what(); ;
    }
}

void p_Scheduler::shutdown() 
{
    /// Shutdown the plugin
    runSchedulerRun.store(0);
    try {
        savedFuture.get();
    }
    catch (const std::exception& ) {
        ;
    }
}

Regarding your comment about calling sleep in coroutine - if you call sleep() , it block the thread (main loop as usual), in what you called it. Coroutines are not asynchronous. Coroutines actually are "GOTO with Classes". "long_jmp" is forgotten, "goto" are forbidden, but they are needed sometimes. When you use co_await, co_return, co_yield, really program makes goto to the place, where caller called coroutine (making some additional work in methods of coroutine). When caller calls coroutine again, really it is goto to the place where last goto in coroutine was from. If program it with real goto, I would be named as stupid programmer, who do not understand simple principles. If I program the with co_await, I am modern advanced etc.

an-tao commented 1 year ago

@VladlenPopolitov I don't quite agree. The goto statement doesn't come with thread switching, but when a coroutine resumes its execution, it's possible (not necessarily) to continue in another thread. When a coroutine's sleep statement is executed, it doesn't block the current thread. Thus, coroutines are indeed asynchronous. Therefore, IMHO, coroutines should be referred to as "flattened callbacks." The asynchronous programming paradigm aims to reduce idle waiting of threads, enabling all threads to operate efficiently. This allows a small number of threads to handle a large volume of concurrent requests. Therefore, for compute-intensive tasks, asynchronous solutions do not offer significant advantages. For such tasks, executing them directly within the current IO thread is sufficient. If there's concern about impacting the response latency of simple APIs, these tasks can also be placed in a thread pool for execution. Simultaneously, proper flow control should be implemented (please refer to the Hodor plugin) to prevent excessive requests from keeping the CPU under prolonged high load. Note that compute-intensive tasks do not include timers, database queries, Redis queries, requesting results from other services, and similar tasks. These tasks can improve throughput by yielding the current thread during IO waiting through callbacks or coroutines.

ErikTheBerik commented 1 year ago

For such tasks, executing them directly within the current IO thread is sufficient. If there's concern about impacting the response latency of simple APIs, these tasks can also be placed in a thread pool for execution.

@an-tao Would you recommend running these compute-intensive tasks using something like "trantor::ConcurrentTaskQueue"? Or what would be your recommended approach?

Simultaneously, proper flow control should be implemented (please refer to the Hodor plugin) to prevent excessive requests from keeping the CPU under prolonged high load.

What do you mean by "proper flow control" and what is the "Hodor" plugin you mention?

Right now I'm running my intensive task (takes between 100 and 500ms to run) directly on the thread I get from Drogon. I'm wondering if there's a better way to delegate that work or if there's a "best practices" for making these kind of APIs in Drogon

VladlenPopolitov commented 1 year ago

@ErikTheBerik trantor::ConcurrentTaskQueue look like excellent option, it is ready solution. I did test code: Declarations and definitions:

#include <trantor/utils/ConcurrentTaskQueue.h>
// define thread pool
trantor::ConcurrentTaskQueue globalThreadPool(10, "h2load test");
// declare awaitable for coroutine (my definitions to make co_await from coroutine)
using calcIntensive = std::function<void()>;
struct [[nodiscard]] ExecuteAwaiter : public CallbackAwaiter<void>
{
    explicit ExecuteAwaiter(calcIntensive func)
        : callAndResume_{std::move(func)}
    { }
    void await_suspend(std::coroutine_handle<> handle)
    {
         auto taskToQueue = [this, handle]() {
            try
            {
                this->callAndResume_();
                handle.resume();
            }
            catch (...)
            {
                auto eptr = std::current_exception();
                setException(eptr);
                handle.resume();
            }
        };
        globalThreadPool.runTaskInQueue(taskToQueue);
    };
  private:
    calcIntensive callAndResume_;
};
// functin to call lambda
struct ExecuteAwaiter executeIntensiveFunction(std::function<void()> func)
{
    struct ExecuteAwaiter retValue
    {
        std::move(func)
    };
    return retValue;
}
// function to call emtry lambda and return execution in the thread from threads pool 
struct ExecuteAwaiter switchToThreadPull()
{
    const std::function<void()> &func = []() { return; };
    struct ExecuteAwaiter retValue
    {
        std::move(func)
    };
    return retValue;
}

Controllers code - version with lambda doing intensive work, and version with switching controlling to other thread from thread pool (code from hello world example):

    Task<void> sleepHello(const HttpRequestPtr req,
                          std::function<void(const HttpResponsePtr &)> callback)
    {
        int someVariables{};
        // current thread from mainLoop returns to mainLoop, 
        // coroutine continues execution in thread from thread pool
        co_await switchToThreadPull();
        using namespace std::chrono_literals;
        // time consuming work here
        std::this_thread::sleep_for(1000ms);

        auto resp = HttpResponse::newHttpResponse();
        resp->setBody(
            "Hi there, this is another hello from the sleep1Hello Controller");
        callback(resp);
        co_return;
    }

    Task<void> sleep2Hello(const HttpRequestPtr req,
                          std::function<void(const HttpResponsePtr &)> callback)
    {
        int someVaribale{};
        // current thread from mainLoop returns to mainLoop
        // coroutine waits execution of lambda function in thread pool 
        // and continue execution in the same thread from thread pool.
        co_await executeIntensiveFunction([someVaribale]() {
            using namespace std::chrono_literals;
            std::this_thread::sleep_for(1000ms);
            return;
        });

        auto resp = HttpResponse::newHttpResponse();
        resp->setBody("Hi there, this is another hello from the sleep2Hello Controller");
        callback(resp);
        co_return;
    }

@biospb Load measurements are:

Requests per second:    4.78 [#/sec] (mean)
Time per request:       2093.243 [ms] (mean)
Time per request:       209.324 [ms] (mean, across all concurrent requests)
Transfer rate:          1.03 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        3    4   0.6      4       5
Processing:  1017 1022  15.6   1017    1067
Waiting:     1017 1022  15.4   1017    1066
Total:       1021 1027  15.1   1021    1069

Percentage of the requests served within a certain time (ms)
  50%   1021
  66%   1022
  75%   1024
  80%   1024
  90%   1069
  95%   1069
  98%   1069
  99%   1069
 100%   1069 (longest request)

Almost all code from Drogon. I did awaiter based on Drogon code too.

an-tao commented 1 year ago

For such tasks, executing them directly within the current IO thread is sufficient. If there's concern about impacting the response latency of simple APIs, these tasks can also be placed in a thread pool for execution.

@an-tao Would you recommend running these compute-intensive tasks using something like "trantor::ConcurrentTaskQueue"? Or what would be your recommended approach?

Yes, ConcurrentTaskQueue is a thread pool implementation.

Simultaneously, proper flow control should be implemented (please refer to the Hodor plugin) to prevent excessive requests from keeping the CPU under prolonged high load.

What do you mean by "proper flow control" and what is the "Hodor" plugin you mention?

Please refer to the Hodor code in Drogon.

Right now I'm running my intensive task (takes between 100 and 500ms to run) directly on the thread I get from Drogon. I'm wondering if there's a better way to delegate that work or if there's a "best practices" for making these kind of APIs in Drogon

If your server just runs these tasks, that's fine, it's simple and straightforward and has good performance guarantees, in fact, I have a machine learning algorithm service that works in the same way.

ErikTheBerik commented 1 year ago

Closing this issue since it satisfied my needs. Thank you very much for your help :)

drogonframework / drogon

Running slow function without blocking requests #1699