aws / aws-sdk-cpp

AWS SDK for C++
Apache License 2.0
1.91k stars 1.03k forks source link

cURL error on concurrent Lambda invocations #2991

Open mcopik opened 2 weeks ago

mcopik commented 2 weeks ago

Describe the bug

I wrote a client that uses the Lambda component to invoke functions asynchronously. I created my own callback, and it seemed to work pretty well. However, when I scaled up to 512 functions, the performance dropped quite quickly - I expected behavior close to linear scaling. I investigated it further, and I found out that there's a default configuration limit of a maximum of 25 TCP connections. So, I changed this parameter to the value of 520, and it started scaling again, although performance became very unpredictable. Unfortunately, for 256 concurrent invocations, we now get an error inside the SDK.

I do not believe this is an issue with Lambda. Our custom implementation of Lambda invoker, based on an HTTP2 client, can scale up much higher concurrent invocations without any issues.

Expected Behavior

The Lambda client from SDK should scale up to the limit of concurrent connections without (a) errors and (b) performance degradation. With the default limit of 25 connections, the SDK can handle even 512 concurrent invocations - it's just very slow.

Current Behavior

I observed the following error on the client. I don't see any failed Lambda invocations in AWS metrics.

Error with Lambda::InvokeRequest. curlCode: 6, Couldn't resolve host name

Reproduction Steps

This is the main invocation code. np is equal to the number of invocations, which in this case is equal to 256. lambda_name refers to the name of my function.

    Aws::SDKOptions options;
    Aws::InitAPI(options);

    Aws::Client::ClientConfiguration clientConfig;
    clientConfig.maxConnections = 520;
    Aws::Lambda::LambdaClient client(clientConfig);
    int id = 0;
    for (int i = 0; i < np; i++) {

        Aws::Lambda::Model::InvokeRequest request;
        request.SetFunctionName(lambda_name);
        Aws::Utils::Json::JsonValue jsonPayload;
        jsonPayload.WithInt64("iterations", n / np);

        std::shared_ptr<Aws::Client::AsyncCallerContext> context =
                Aws::MakeShared<Aws::Client::AsyncCallerContext>("tag");
        context->SetUUID(std::to_string(id++).c_str());

        std::shared_ptr<Aws::IOStream> payload = Aws::MakeShared<Aws::StringStream>(
                "FunctionTest");
        *payload << jsonPayload.View().WriteReadable();
        request.SetBody(payload);
        request.SetContentType("application/json");

        client.InvokeAsync(request, handler, context);
 }

The handler function fails immediately at this step:

void handler(
    const Aws::Lambda::LambdaClient*, const Aws::Lambda::Model::InvokeRequest&,
    Aws::Lambda::Model::InvokeOutcome outcome, const std::shared_ptr<const Aws::Client::AsyncCallerContext>& ctx
)
{
  if (!outcome.IsSuccess()) {
    std::cerr << "Error with Lambda::InvokeRequest. "
                << outcome.GetError().GetMessage()
                << std::endl;
    exit(1);
  }

Possible Solution

No response

Additional Information/Context

No response

AWS CPP SDK version used

Current git master, commit f067d450a8689f3ae05fbcd96039cdd9f2d0276c

Compiler and Version used

Clang 15 (custom fork)

Operating System and version

Ubuntu 22.04

SergeyRyabinin commented 2 weeks ago

Hi @mcopik ,

Thanks a lot for submitting this issue.

Why it happens

Our current async execution model is quite bad: it simply uses regular sync blocking calls and wraps them into execution on a separate thread using the thread executor. such as

void Lambda::InvokeAsync(...)
{
  threadExecutor->Submit(std::function<...>(Lambda::Invoke(...))
}

The default thread executor is going to spawn a separate thread for each submitted async operation. There is a slightly better PooledThreadExecutor that is going to use a set of threads, avoiding the usage of 520 threads for each single operation call.

Another big problem with our current async model is that we use sync/blocking HTTP clients, such as WinHTTP in a sync mode or libCurl using curl_easy_handle, so that SDK can't send out the HTTP request and allow thread to execute some other code.

Therefore, when you submit 520 async requests, SDK is going to spawn 520 threads and 520 curl_easy_handles each creating their own HTTP connection (including the TLS session).

Is there any mitigation

I'd suggest to use PooledThreadExecutor, it won't improve overall throughput, however, it will reduce the amount of threads being spawned.

Long-term plan

We plan to improve/refactor our async model in order to use proper async request handling with the usage of async HTTP client, such as "curl_multi_handle" and AWS CRT HTTP client in the async mode. Right now it lives on this branch: https://github.com/aws/aws-sdk-cpp/tree/sr/curlMulti2 Unfortunately, we can't give any ETA here.

I'll mark this issue as a feature request. Please let us know if you have any other question about the SDK.

Best regards, Sergey