Between version 1.11.159 and 1.11.305, the GetObjectAsync method of S3CrtClient has become very very slow.

Describe the bug

After upgrading the AWS SDK for C++, performance issues occurred. To confirm the behavior, I created a simple program that PUTs 500 objects of 32MB each using S3CrtClient's PutObjectAsync and then GETs them using GetObjectAsync.

As a result, with version 1.11.159, the GET operation for the 500 objects completed in approximately 11 seconds. However, with version 1.11.305, even after 2 minutes, the operation did not finish.

Upon checking the CPU usage, it was observed that the CPU was completely unusable. Therefore, it is suspected that some of the changes between versions 1.11.159 and 1.11.305 has caused degradation in the GetObjectAsync functionality.

Expected Behavior

The performance of GetObjectAsync does not change between SDK version updates.

Current Behavior

When retrieving a large number of objects using GetObjectAsync, the CPU is not utilized efficiently, resulting in significantly longer execution times.

Reproduction Steps

This occurs when PUTting 500 objects of 32MB each using S3CrtClient's PutObjectAsync and then GETting them using GetObjectAsync.

Possible Solution

No response

Additional Information/Context

No response

AWS CPP SDK version used

1.11.305

Compiler and Version used

11.4.1 20230605 (Red Hat 11.4.1-2)

Operating System and version

RHEL9.3

Can you please provide more detailed reproduction steps and include the code sample you are running for your tests?

Thank you for your comment. The test code used is as follows. Replace "YOUR_REGION_NAME" and "YOUR_BUCKET_NAME" with the words appropriate for your environment.

#include <aws/core/Aws.h>
#include <aws/core/utils/HashingUtils.h>
#include <aws/core/utils/stream/PreallocatedStreamBuf.h>
#include <aws/s3-crt/S3CrtClient.h>
#include <aws/s3-crt/model/DeleteObjectsRequest.h>
#include <aws/s3-crt/model/GetObjectRequest.h>
#include <aws/s3-crt/model/PutObjectRequest.h>

#include <fstream>
#include <iostream>

static const char ALLOCATION_TAG[] = "s3-crt-demo";

#define TARGETGBPS (10)
#define PARTSIZE (999)
#define EVENTLOOPTHREAD (8)
#define DOWNLOADNUM (500)
#define OBJSIZE (32)

typedef struct
{
  bool isComplete;
  Aws::S3Crt::Model::PutObjectOutcome outcome;
  Aws::S3Crt::PutObjectResponseReceivedHandler handler;
} putObjectContext;

typedef struct
{
  bool isComplete;
  Aws::S3Crt::Model::GetObjectOutcome outcome;
  Aws::S3Crt::GetObjectResponseReceivedHandler handler;
} getObjectContext;

class OriginalIOStream : public Aws::IOStream
{
 public:
  using Base = Aws::IOStream;
  OriginalIOStream(std::streambuf *buf) : Base(buf) {}

  virtual ~OriginalIOStream() = default;
};

putObjectContext *PutAsync(const Aws::S3Crt::S3CrtClient &s3CrtClient,
                           const Aws::S3Crt::Model::PutObjectRequest &putRequest)
{
  putObjectContext *cntx = Aws::New<putObjectContext>(ALLOCATION_TAG);
  cntx->isComplete = false;

  cntx->handler = Aws::S3Crt::PutObjectResponseReceivedHandler{
      [=](const Aws::S3Crt::S3CrtClient *,
          const Aws::S3Crt::Model::PutObjectRequest &,
          Aws::S3Crt::Model::PutObjectOutcome outcome,
          const std::shared_ptr<const Aws::Client::AsyncCallerContext> &) {
        cntx->outcome = std::move(outcome);
        cntx->isComplete = true;
      }};

  s3CrtClient.PutObjectAsync(putRequest, cntx->handler, nullptr);

  return cntx;
}

getObjectContext *GetAsync(const Aws::S3Crt::S3CrtClient &s3CrtClient,
                           const Aws::S3Crt::Model::GetObjectRequest &getRequest)
{
  getObjectContext *cntx = Aws::New<getObjectContext>(ALLOCATION_TAG);
  cntx->isComplete = false;

  cntx->handler = Aws::S3Crt::GetObjectResponseReceivedHandler{
      [=](const Aws::S3Crt::S3CrtClient *,
          const Aws::S3Crt::Model::GetObjectRequest &,
          Aws::S3Crt::Model::GetObjectOutcome outcome,
          const std::shared_ptr<const Aws::Client::AsyncCallerContext> &) {
        cntx->outcome = std::move(outcome);
        cntx->isComplete = true;
      }};

  s3CrtClient.GetObjectAsync(getRequest, cntx->handler, nullptr);

  return cntx;
}

void SetBufferToGetObject(Aws::S3Crt::Model::GetObjectRequest &getRequest,
                          unsigned char *buffer,
                          long long int size)
{
  getRequest.SetResponseStreamFactory([=]() {
    return Aws::New<OriginalIOStream>(
        ALLOCATION_TAG,
        Aws::New<Aws::Utils::Stream::PreallocatedStreamBuf>(ALLOCATION_TAG, buffer, size));
  });
  return;
}

int main(void)
{
  Aws::SDKOptions options;
  options.ioOptions.clientBootstrap_create_fn = []() {
    Aws::Crt::Io::EventLoopGroup eventLoopGroup(EVENTLOOPTHREAD /* threadCount */);
    Aws::Crt::Io::DefaultHostResolver defaultHostResolver(eventLoopGroup,
                                                          8 /* maxHosts */,
                                                          300 /* maxTTL */);
    auto clientBootstrap = Aws::MakeShared<Aws::Crt::Io::ClientBootstrap>(ALLOCATION_TAG,
                                                                          eventLoopGroup,
                                                                          defaultHostResolver);
    clientBootstrap->EnableBlockingShutdown();
    return clientBootstrap;
  };

  Aws::InitAPI(options);
  {
    // TODO: Set to your account AWS Region.
    Aws::String region = Aws::Region::YOUR_REGION_NAME;

    std::cout << "Region : " << region << std::endl;

    const double throughput_target_gbps = TARGETGBPS;
    const uint64_t part_size = PARTSIZE * 1024 * 1024;

    Aws::S3Crt::ClientConfiguration config;
    config.region = region;
    config.throughputTargetGbps = throughput_target_gbps;
    config.partSize = part_size;
    // config.scheme = Aws::Http::Scheme::HTTP;

    Aws::S3Crt::S3CrtClient s3CrtClient(config);

    Aws::String bucketName = "YOUR_BUCKET_NAME";
    // Aws::String keyName = "put-get-object";
    // Aws::String fileName = "put-get-file";

    /*** PUT OBJECT ***/
    {
      /* リクエスト生成 */
      std::shared_ptr<Aws::S3Crt::Model::PutObjectRequest> putRequest[DOWNLOADNUM];
      for (int i = 0; i < DOWNLOADNUM; i++)
      {
        putRequest[i] = Aws::MakeShared<Aws::S3Crt::Model::PutObjectRequest>(ALLOCATION_TAG);
        char key[40];
        char file[40];
        sprintf(key, "put-get-object_%d", i % DOWNLOADNUM);
        sprintf(file, "put-get-file_%d", i % DOWNLOADNUM);
        putRequest[i]->SetBucket(bucketName);
        putRequest[i]->SetKey(key);

        // std::shared_ptr<Aws::IOStream> bodyStream = Aws::MakeShared<Aws::FStream>(ALLOCATION_TAG,
        // file, std::ios_base::in | std::ios_base::binary); auto bodyStream =
        // Aws::MakeShared<Aws::FStream>(ALLOCATION_TAG);

        std::ifstream tmpifs(file, std::ios_base::in | std::ios_base::binary);
        auto buffer = new unsigned char[OBJSIZE * 1024 * 1024];
        tmpifs.read((char *)buffer, OBJSIZE * 1024 * 1024);
        std::shared_ptr<Aws::IOStream> bodyStream = Aws::MakeShared<OriginalIOStream>(
            ALLOCATION_TAG,
            Aws::New<Aws::Utils::Stream::PreallocatedStreamBuf>(ALLOCATION_TAG,
                                                                buffer,
                                                                OBJSIZE * 1024 * 1024));

        putRequest[i]->SetBody(bodyStream);

        putRequest[i]->SetChecksumCRC32(Aws::Utils::HashingUtils::Base64Encode(
            Aws::Utils::HashingUtils::CalculateCRC32(*(putRequest[i]->GetBody()))));
        putRequest[i]->SetChecksumAlgorithm(Aws::S3Crt::Model::ChecksumAlgorithm::CRC32);
      }

      putObjectContext *putContext[DOWNLOADNUM];

      std::cout << "Prepare PutObjectAsync done." << std::endl;
      std::this_thread::sleep_for(std::chrono::seconds(2));

      /* 実行 */

      std::cout << "Start PutObjectAsync" << std::endl;
      std::chrono::system_clock::time_point start = std::chrono::system_clock::now();
      for (int i = 0; i < DOWNLOADNUM; i++)
      {
        putContext[i] = PutAsync(s3CrtClient, *putRequest[i]);
      }

      for (int i = 0; i < DOWNLOADNUM; i++)
      {
        while (putContext[i]->isComplete == false)
        {
          std::this_thread::sleep_for(std::chrono::microseconds(1000));
        }

        if (putContext[i]->outcome.IsSuccess())
        {
          std::cout << ".";
        }
        else
        {
          std::cout << "Failed to putObject" << std::endl;
          std::cout << putContext[i]->outcome.GetError() << std::endl;
          return false;
        }
      }
      std::chrono::system_clock::time_point end = std::chrono::system_clock::now();
      std::cout << "DONE" << std::endl;

      double time =
          std::chrono::duration_cast<std::chrono::microseconds>(end - start).count() / 1000000.0;
      std::cout << "PutObject Time : " << time << std::endl;
    }

    std::this_thread::sleep_for(std::chrono::seconds(2));

    /*** GET OBJECT ***/
    {
      /* バッファ・リクエスト生成 */
      unsigned char *buffer;

      std::shared_ptr<Aws::S3Crt::Model::GetObjectRequest> getRequest[DOWNLOADNUM];
      for (int i = 0; i < DOWNLOADNUM; i++)
      {
        getRequest[i] = Aws::MakeShared<Aws::S3Crt::Model::GetObjectRequest>(ALLOCATION_TAG);
        char key[40];
        sprintf(key, "put-get-object_%d", i % DOWNLOADNUM);

        buffer = new unsigned char[OBJSIZE * 1024 * 1024];
        /* 汚しておく */
        memset(buffer, 0x5a, OBJSIZE * 1024 * 1024);
        memset(buffer, 0xa5, OBJSIZE * 1024 * 1024);

        getRequest[i]->SetBucket(bucketName);
        getRequest[i]->SetKey(key);

        SetBufferToGetObject(*getRequest[i], buffer, OBJSIZE * 1024 * 1024);
      }

      getObjectContext *getContext[DOWNLOADNUM];

      std::cout << "Prepare GetObjectAsync done." << std::endl;
      std::this_thread::sleep_for(std::chrono::seconds(2));

      /* 実行 */

      std::cout << "Start GetObjectAsync" << std::endl;
      std::chrono::system_clock::time_point start = std::chrono::system_clock::now();
      for (int i = 0; i < DOWNLOADNUM; i++)
      {
        getContext[i] = GetAsync(s3CrtClient, *getRequest[i]);
      }

      for (int i = 0; i < DOWNLOADNUM; i++)
      {
        while (getContext[i]->isComplete == false)
        {
          std::this_thread::sleep_for(std::chrono::microseconds(1000));
        }

        if (getContext[i]->outcome.IsSuccess())
        {
          std::cout << ".";
        }
        else
        {
          std::cout << "Failed to getObject" << std::endl;
          return false;
        }
      }
      std::chrono::system_clock::time_point end = std::chrono::system_clock::now();
      std::cout << "DONE" << std::endl;

      double time =
          std::chrono::duration_cast<std::chrono::microseconds>(end - start).count() / 1000000.0;
      std::cout << "GetObject Time : " << time << std::endl;
    }

    /*** DELETE OBJECTS ***/
    {
      Aws::S3Crt::Model::DeleteObjectsRequest delRequest;
      Aws::S3Crt::Model::Delete del;

      /* リクエスト生成 */
      for (int i = 0; i < DOWNLOADNUM; i++)
      {
        Aws::S3Crt::Model::ObjectIdentifier objectId;
        char key[40];
        sprintf(key, "put-get-object_%d", i % DOWNLOADNUM);
        objectId.SetKey(key);
        del.AddObjects(objectId);
      }
      delRequest.SetBucket(bucketName);
      delRequest.SetDelete(del);

      /* 実行（早いはずなので同期実行） */
      std::cout << "Start DeleteObjects" << std::endl;
      std::chrono::system_clock::time_point start = std::chrono::system_clock::now();

      auto delOutcome = s3CrtClient.DeleteObjects(delRequest);
      if (!delOutcome.IsSuccess())
      {
        std::cout << "Failed to deleteObjects" << std::endl;
        return false;
      }

      std::chrono::system_clock::time_point end = std::chrono::system_clock::now();
      std::cout << "DONE" << std::endl;

      double time =
          std::chrono::duration_cast<std::chrono::microseconds>(end - start).count() / 1000000.0;
      std::cout << "DeleteObjects Time : " << time << std::endl;
    }
  }
  Aws::ShutdownAPI(options);
  return 0;
}

This is an issue with CRT mem limiter feature that got added around reinvent and CPP sdk picked up sometime since then. I created https://github.com/awslabs/aws-c-s3/issues/425 to follow up on it on CRT side.

Quick summary of the issue is that you are configuring part size to be 999 mb, but all the files you are dealing with are in 32 mb size. CRT does not know how big the result of get will be, so it uses some heuristics to predict the how much memory will be needed to buffer the gets. Part size is part of that heuristic and in this case part size of 999 mb causes CRT to be overly cautious and only schedule a couple reqs at a time to avoid blowing up memory.

Lowering part size should improve throughput. Also note that with part size much higher than object size, you would not see a lot of benefit from crt as it will revert to making one request per object. We recommend keeping part size in the 8mb-16mb range

Thank you for your explanation. It used to work, but it wasn't good that I set the "part size" to 999MB in the code. I understand.

When I changed the "part size" to 32MB, the GET worked in the same time as before. Thank you!

Lowering part size should improve throughput. Also note that with part size much higher than object size, you would not see a lot of benefit from crt as it will revert to making one request per object. We recommend keeping part size in the 8mb-16mb range

I thought that if the "part size" was smaller than the object size, the number of communications would increase, which was not good. When I tried changing the "part size" to 16MB in my code, I got "INTERNAL_FAILURE" exception in PUT. It's not enough to simply change the "part size". I'll check it out. Thank you for this too.

Do you have any specific concerns with CRT using multiple connections when part size is configured lower. CRT is optimized toward getting higher throughput and the recommended way to achieve that with s3 is to parallelize requests across multiple connections. The downside of that approach is that s3 charges per each individual request. We currently do not have a way to turn off automated splitting for Gets and thats a potential feature we can consider.

Do you have any logs for INTERNAL_FAILURE (setting logs to trace should dump relative info for CRT as well)? CRT sets part size to 8mb by default and looking at your code i dont see anything that might cause internal error.

After outputting and checking the logs, I found that the cause was that my program was pre-calculating the checksum. Code:

        putRequest[i]->SetChecksumCRC32(Aws::Utils::HashingUtils::Base64Encode(
            Aws::Utils::HashingUtils::CalculateCRC32(*(putRequest[i]->GetBody()))));

Log:

[ERROR] 2024-04-16 05:05:47.722 S3MetaRequest [139891380028160] Could not create auto-ranged-put meta request; checksum headers has been set for auto-ranged-put that will be split. Pre-calculated checksums are only supported for single part upload.
[ERROR] 2024-04-16 05:05:47.722 S3Client [139891380028160] id=0x1e516e0: Could not create new meta request.

There were no issues when the checksum specification was omitted. Thank you.

Do you have any specific concerns with CRT using multiple connections when part size is configured lower. CRT is optimized toward getting higher throughput and the recommended way to achieve that with s3 is to parallelize requests across multiple connections. The downside of that approach is that s3 charges per each individual request. We currently do not have a way to turn off automated splitting for Gets and thats a potential feature we can consider.

If it improves performance, there are no concerns with CRT using multiple connections. However, as indicated below, the execution time is approximately doubled when skipping the pre-calculation of checksum, which is not acceptable.

Checksum	PartSize	ObjectSize	PutTime(Process 500 objects asynchronously)
Pre-calculate the CRC32 checksum (as shown in the provided code)	32M	32M	11.596sec
Omit the specification (presumably calculated using MD5)	32M	32M	26.244sec
Omit the specification (presumably calculated using MD5)	8M	32M	27.467sec

Hey @MaZvnGpGm9 spent some time digging into this and writing some benchmark test based on your code and came up with insights into what you are seeing. for refernce here is the benchmark code i was running (note i was using google/benchmark to run the tests):

main.cpp:

#include <benchmark/benchmark.h>
#include <aws/core/Aws.h>
#include <aws/core/utils/HashingUtils.h>
#include <fstream>
#include <aws/s3-crt/S3CrtClient.h>
#include <aws/s3-crt/model/PutObjectRequest.h>

using namespace Aws;
using namespace Aws::Utils;
using namespace Aws::S3Crt;
using namespace Aws::S3Crt::Model;

const char *ALLOCATION_TAG = "checksum_benchmark";
const char *BUCKET = "your_bucket";
const char *KEY = "your_key";
constexpr  int REQEUSTS_TO_MAKE = 10;

static SDKOptions s_options;

static void DoSetup(const benchmark::State &state)
{
    InitAPI(s_options);
}

static void DoTeardown(const benchmark::State &state)
{
    ShutdownAPI(s_options);
}

static std::shared_ptr<S3CrtClient> CreateClient()
{
    S3Crt::ClientConfiguration config;
    config.throughputTargetGbps = 10;
    config.partSize = 32 * 1024 * 1024;
    return Aws::MakeShared<S3CrtClient>(ALLOCATION_TAG, config);
}

// read a file from a dir named something like "32mb"
static std::shared_ptr<IOStream> CreateStream(size_t fileSize)
{
    return Aws::MakeShared<FStream>(ALLOCATION_TAG,
        "/path/to/test/file/" + std::to_string(fileSize) + "mb",
        std::ios_base::in);
}

static PutObjectRequest CreateRequest(size_t fileSize)
{
    auto stream = CreateStream(fileSize);
    auto request = PutObjectRequest().WithBucket(BUCKET).WithKey(KEY);
    request.SetBody(stream);
    return request;
}

static void StartAsyncRequest(const std::shared_ptr<S3CrtClient>& client,
    int& totalReqs,
    std::condition_variable& cv,
    std::mutex& requestMutex,
    const PutObjectRequest& request)
{
    client->PutObjectAsync(request,
    [&totalReqs, &requestMutex, &cv](const S3CrtClient*,
                                     const Model::PutObjectRequest&,
                                     const Model::PutObjectOutcome& response,
                                     const std::shared_ptr<const Aws::Client::AsyncCallerContext>&) -> void {
        std::unique_lock lock(requestMutex);
        assert(response.IsSuccess());
        if (!response.IsSuccess()) {
            std::cerr << "benchmark saw error: " << response.GetError().GetMessage() << "\n";
        } else {
            //Remove const to remove optmizaiton
            auto mutResp = response;
            benchmark::DoNotOptimize(mutResp);
        }
        totalReqs--;
        cv.notify_all();
    });
}

static void BM_MD5Checksum(benchmark::State &state) {
    auto stream = CreateStream(state.range(0));
    for (auto _: state) {
        auto hash = HashingUtils::CalculateMD5(*stream);
        benchmark::DoNotOptimize(hash);
    }
}

static void BM_CRC32Checksum(benchmark::State &state) {
    auto stream = CreateStream(state.range(0));
    for (auto _: state) {
        auto hash = HashingUtils::CalculateCRC32(*stream);
        benchmark::DoNotOptimize(hash);
    }
}

static void BM_S3PutObjectWithoutPrecalcChecksum(benchmark::State &state) {
    const auto client = CreateClient();
    for (auto _: state) {
        auto request = CreateRequest(state.range(0));
        auto response = client->PutObject(request);
        assert(response.IsSuccess());
        if (!response.IsSuccess()) {
            std::cerr << "benchmark saw error: " << response.GetError().GetMessage() << "\n";
        }
        benchmark::DoNotOptimize(response);
    }
}

static void BM_S3PutObjectWithPrecalcChecksum(benchmark::State &state) {
    const auto client = CreateClient();
    for (auto _: state) {
        auto request = CreateRequest(state.range(0));
        request.SetChecksumCRC32(HashingUtils::Base64Encode(HashingUtils::CalculateCRC32(*(request.GetBody()))));
        auto response = client->PutObject(request);
        assert(response.IsSuccess());
        if (!response.IsSuccess()) {
            std::cerr << "benchmark saw error: " << response.GetError().GetMessage() << "\n";
        }
        benchmark::DoNotOptimize(response);
    }
}

static void BM_S3PutObjectAsyncWithoutPrecalcChecksum(benchmark::State &state) {
    const auto client = CreateClient();
    for (auto _: state) {
        int totalReqs = REQEUSTS_TO_MAKE;
        std::condition_variable cv;
        std::mutex requestMutex;
        for (int i = 0; i < totalReqs; ++i) {
            auto request = CreateRequest(state.range(0));
            StartAsyncRequest(client, totalReqs, cv, requestMutex, request);
        }
        std::unique_lock lock(requestMutex);
        cv.wait(lock, [&totalReqs]() -> bool {
            return  totalReqs == 0;
        });
    }
}

static void BM_S3PutObjectAsyncWithPrecalcChecksum(benchmark::State &state) {
    const auto client = CreateClient();
    for (auto _: state) {
        int totalReqs = REQEUSTS_TO_MAKE;
        std::condition_variable cv;
        std::mutex requestMutex;
        for (int i = 0; i < totalReqs; ++i) {
            auto request = CreateRequest(state.range(0));
            request.SetChecksumCRC32(HashingUtils::Base64Encode(HashingUtils::CalculateCRC32(*(request.GetBody()))));
            StartAsyncRequest(client, totalReqs, cv, requestMutex, request);
        }
        std::unique_lock lock(requestMutex);
        cv.wait(lock, [&totalReqs]() -> bool {
            return  totalReqs == 0;
        });
    }
}

BENCHMARK(BM_MD5Checksum)
        ->Setup(DoSetup)
        ->ArgName("File Size in MB")
        ->Arg(32)
        ->Iterations(10)
        ->Unit(benchmark::kMillisecond)
        ->Teardown(DoTeardown)
        ->MeasureProcessCPUTime()
        ->UseRealTime();

BENCHMARK(BM_CRC32Checksum)
        ->Setup(DoSetup)
        ->ArgName("File Size in MB")
        ->Arg(32)
        ->Iterations(10)
        ->Unit(benchmark::kMillisecond)
        ->Teardown(DoTeardown)
        ->MeasureProcessCPUTime()
        ->UseRealTime();

BENCHMARK(BM_S3PutObjectWithoutPrecalcChecksum)
        ->Setup(DoSetup)
        ->ArgName("File Size in MB")
        ->Arg(32)
        ->Iterations(10)
        ->Unit(benchmark::kSecond)
        ->Teardown(DoTeardown)
        ->MeasureProcessCPUTime()
        ->UseRealTime();

BENCHMARK(BM_S3PutObjectWithPrecalcChecksum)
        ->Setup(DoSetup)
        ->ArgName("File Size in MB")
        ->Arg(32)
        ->Iterations(10)
        ->Unit(benchmark::kSecond)
        ->Teardown(DoTeardown)
        ->MeasureProcessCPUTime()
        ->UseRealTime();

BENCHMARK(BM_S3PutObjectAsyncWithoutPrecalcChecksum)
        ->Setup(DoSetup)
        ->ArgName("File Size in MB")
        ->Arg(32)
        ->Iterations(5)
        ->Unit(benchmark::kSecond)
        ->Teardown(DoTeardown)
        ->MeasureProcessCPUTime()
        ->UseRealTime();

BENCHMARK(BM_S3PutObjectAsyncWithPrecalcChecksum)
        ->Setup(DoSetup)
        ->ArgName("File Size in MB")
        ->Arg(32)
        ->Iterations(5)
        ->Unit(benchmark::kSecond)
        ->Teardown(DoTeardown)
        ->MeasureProcessCPUTime()
        ->UseRealTime();

BENCHMARK_MAIN();

CMakeLists.txt

cmake_minimum_required(VERSION 3.13)
project(sdk_benchmark)
set(CMAKE_CXX_STANDARD 20)

include(FetchContent)

FetchContent_Declare(gbench
    GIT_REPOSITORY https://github.com/google/benchmark
    GIT_TAG        v1.8.3
)
FetchContent_MakeAvailable(gbench)

find_package(AWSSDK REQUIRED COMPONENTS core s3-crt)

add_executable(${PROJECT_NAME} "main.cpp")
target_link_libraries(${PROJECT_NAME} benchmark::benchmark ${AWSSDK_LINK_LIBRARIES})

Might you note that i only run 10 in parallel and not 500, when running with 500 testing was taking way too long, what kind of hardware are you testing on? im running on a macbook pro where it ends up looking like

Run on (10 X 24 MHz CPU s)
CPU Caches:
  L1 Data 64 KiB
  L1 Instruction 128 KiB
  L2 Unified 4096 KiB (x10)
Load Average: 3.34, 4.07, 3.57

but the results of that benchmark tests show

Head:

Benchmark	Time	CPU	Iterations
BM_MD5Checksum/File Size in MB:32/iterations:10/process_time/real_time	70.6 ms	70.1 ms	10
BM_CRC32Checksum/File Size in MB:32/iterations:10/process_time/real_time	11.0 ms	10.9 ms	10
BM_S3PutObjectWithoutPrecalcChecksum/File Size in MB:32/iterations:10/process_time/real_time	2.23 s	0.269 s	10
BM_S3PutObjectWithPrecalcChecksum/File Size in MB:32/iterations:10/process_time/real_time	2.13 s	0.218 s	10
BM_S3PutObjectAsyncWithoutPrecalcChecksum/File Size in MB:32/iterations:5/process_time/real_time	27.7 s	3.30 s	5
BM_S3PutObjectAsyncWithPrecalcChecksum/File Size in MB:32/iterations:5/process_time/real_time	24.0 s	2.44 s	5

1.11.159

Benchmark	Time	CPU	Iterations
BM_MD5Checksum/File Size in MB:32/iterations:10/process_time/real_time	69.1 ms	68.9 ms	10
BM_CRC32Checksum/File Size in MB:32/iterations:10/process_time/real_time	24.1 ms	24.1 ms	10
BM_S3PutObjectWithoutPrecalcChecksum/File Size in MB:32/iterations:10/process_time/real_time	2.23 s	0.261 s	10
BM_S3PutObjectWithPrecalcChecksum/File Size in MB:32/iterations:10/process_time/real_time	2.25 s	0.301 s	10
BM_S3PutObjectAsyncWithoutPrecalcChecksum/File Size in MB:32/iterations:5/process_time/real_time	23.7 s	2.97 s	5
BM_S3PutObjectAsyncWithPrecalcChecksum/File Size in MB:32/iterations:5/process_time/real_time	25.1 s	3.33 s	5

From this we can draw several conclusions

There is actually a performance gain between 1.11.159 and Head in terms of precalculated checksum performance that makes it ~30 percent faster (cpu timing) than without any checksum specified. This is due to a change that we made in the SDK that fixed pre calculated checksums as a whole. Before that change we would actually be calculating the checksum incorrectly or perhaps twice. If you noticed that no checksum and default MD5 were the same as before it was because of this bug and this discrepency should have actually been there.
MD5 is really slow and you shouldnt be using it if you care about performance. From this benchmark we can see that just calculating MD5 is 700 percent more expensive than CRC32. This is what is contributing to see the discrepancy that you are seeing in your example. when you say "the execution time is approximately doubled when skipping the pre-calculation of checksum" as on higher end machiens with more throughtput the checksum becomes the bottleneck. If you were to change the default checksum to CRC32 i would expect them to be the same. If you would like a feature request to configure default checksum please create one and we can prioritize it. Right now you can set it on the request by specifying request.SetChecksumAlgorithm(ChecksumAlgorithm::CRC32); on your request.

Let me know what you think and if you can replicate the same results with the benchmark i have provided. Of course if you can update the benchmark to reflect something else please go ahead and do such and let me know what hardware you are running on.

Greetings! It looks like this issue hasn’t been active in longer than a week. We encourage you to check if this is still an issue in the latest release. Because it has been longer than a week since the last update on this, and in the absence of more information, we will be closing this issue soon. If you find that this is still a problem, please feel free to provide a comment or add an upvote to prevent automatic closure, or if the issue is already closed, please feel free to open a new one.

aws / aws-sdk-cpp