aws / aws-sdk-cpp

AWS SDK for C++
Apache License 2.0
1.97k stars 1.06k forks source link

AWSClient segfaults after a data race #3084

Closed k0zmo closed 2 months ago

k0zmo commented 2 months ago

Describe the bug

Following simple code that starts a few async calls and then interrupts them segfaults (randomly):

#include <aws/core/Aws.h>
#include <aws/core/auth/AWSAuthSigner.h>
#include <aws/core/auth/AWSCredentialsProvider.h>
#include <aws/core/utils/Outcome.h>
#include <aws/s3/S3Client.h>
#include <aws/s3/model/GetObjectRequest.h>
#include <aws/s3/model/HeadObjectRequest.h>

#include <algorithm>

const auto bucket = "test-bucket";
const auto key = "bigfile";

int main()
{
    Aws::SDKOptions options;
    Aws::InitAPI(options);

    auto prov = Aws::MakeShared<Aws::Auth::EnvironmentAWSCredentialsProvider>("EnvironmentAWSCredentialsProvider");
    Aws::Auth::AWSCredentials creds = prov->GetAWSCredentials();
    Aws::Client::ClientConfiguration config;

    auto client = Aws::MakeShared<Aws::S3::S3Client>( "S3Client", prov, config, Aws::Client::AWSAuthV4Signer::PayloadSigningPolicy::Never, true );

    Aws::S3::Model::HeadObjectRequest hor;
    hor.WithBucket(bucket).WithKey(key);
    auto outcome = client->HeadObject(hor);
    if (!outcome.IsSuccess())
    {
        return -1;
    }

    uint64_t remainingFileSize = outcome.GetResult().GetContentLength();
    uint64_t readPosition = 0;
    const auto readAheadQueueSize = 15;
    const auto readSectorSize = 8Ull * 1024Ull * 1024Ull;

    std::deque<Aws::S3::Model::GetObjectOutcomeCallable> queuedRequests;

    while (queuedRequests.size() < readAheadQueueSize)
    {
        uint64_t readSize = std::min(readSectorSize, remainingFileSize);

        std::ostringstream readRange;
        readRange << "bytes=" << readPosition << "-" << (readPosition + readSize - 1); // Inclusive range

        Aws::S3::Model::GetObjectRequest objectRequest;
        objectRequest.WithBucket(bucket).WithKey(key).WithRange(readRange.str());

        queuedRequests.push_back(client->GetObjectCallable(objectRequest));

        remainingFileSize -= readSize;
        readPosition += readSize;
    }

    client.reset();

    Aws::ShutdownAPI(options);
}

There are two reasons for the segfault. One is extended lifetime of executor - ClientConfiguration creates a shared_ptr to Executor, passes it onto the client but because its lifetime is longer than a client, it makes the Executor to live longer than the client.

However, there's an implicit assumption when destroying the client that the executor will run to its completion (at least for already started packaged_tasks) as part of its destructor. This obviously is not invoked as Executor outlives the client. The most obvious fix is to add client.executor.reset() after creating an instance of client so it is the only owner of Executor.

The problem stems from using shared_ptr and simultaneously assuming it is the only owner - which cannot be guaranteed.

Secondly, there's another data race (also leading to segfault pretty consistently) in ClientWithAsyncTemplateMethods::ShutdownSdkClient which is called as part of S3Client::~S3Client(). There, we first release the m_endpointProvider and then m_executor (but only if it's the only owner). This is wrong as m_endpointProvider can still be used by threads started by the Executor - such as: image The fix here is simply to move the m_endpointProvider.reset() call after m_executor is released:

            pClient->m_executor.reset();
            pClient->m_clientConfiguration.executor.reset();
            pClient->m_clientConfiguration.retryStrategy.reset();
            pClient->m_endpointProvider.reset();

Running original version with thread sanitizer yields more than 100 warnings while the above changes gives none for attached sample code

Expected Behavior

No segfault nor datarace

Current Behavior

Above code segfault in various places, most likely on dereferencing already released EndpointProvider but also deep inside the AttemptExhaustively function on dereferencing RetryStrategy

Reproduction Steps

Run above code few times and observe segfaults

Possible Solution

No response

Additional Information/Context

No response

AWS CPP SDK version used

AWS SDK CPP, version 1.11.352

Compiler and Version used

gcc 11.4.0, Visual Studio 17.11, clang 14.0

Operating System and version

Ubuntu 22.04 and Windows 11

jmklix commented 2 months ago

Can you try adding brackets around the sdk code so that it is all in the same scope and is cleaned up correctly before Aws::ShutdownAPI is called? This is mentioned in our basic use here:

#include <aws/core/Aws.h>
int main(int argc, char** argv)
{
   Aws::SDKOptions options;
   Aws::InitAPI(options);
   {
      // make your SDK calls here.
   }
   Aws::ShutdownAPI(options);
   return 0;
}
SergeyRyabinin commented 2 months ago

Hi @k0zmo ,

thank you for bringing and describing in details this issue. Additional thank you for pointing to the ClientConfiguration object storing extra reference to the Executor.

I agree with @jmklix in referring to our basic-use instruction: this CPP SDK is designed in a such a way that it requires extra scope, and extra caution in tracking async running tasks from the user.

However, this is a recurring topic, so I submitted a PR https://github.com/aws/aws-sdk-cpp/pull/3087 having your suggestion to re-arrange the destruction order, as well as, in addition, some refactoring to use factories by default within the ClientConfiguration, and some other minor changes to track client "is initialized" checks. I hope we are going to merge this PR soon, while it is not going to solve all "Init-Shutdown-memory-allocator-async-tasks-tracking" kind of issues, it is still an improvement.

Best regards, Sergey

k0zmo commented 2 months ago

@jmklix Limiting a scope for AWSClient like you shown doesn't make difference. The problem I reported comes from an entangled lifetime between ClientConfiguration and AWSClient, not AWSClient and the global SDK state.

github-actions[bot] commented 2 months ago

This issue is now closed. Comments on closed issues are hard for our team to see. If you need more assistance, please open a new issue that references this one.