aws / aws-sdk-cpp

AWS SDK for C++
Apache License 2.0
1.91k stars 1.04k forks source link

Destructor S3CrtClient::~S3CrtClient() hangs, as of release 1.11.195 #2969

Closed hunjmes closed 1 month ago

hunjmes commented 1 month ago

Describe the bug

Symptoms are superficially similar to https://github.com/aws/aws-sdk-cpp/issues/2769 , except that issue involved calling Aws::ShutdownAPI(), while an S3CrtClient was still alive.

This bug involves the S3CrtClient::~S3CrtClient() destructor, itself. We see a hang/deadlock inside:

#0  0x00007fcfffe93377 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fcffdc63800 in __gthread_cond_wait (__mutex=<optimized out>, __cond=0x7fcff3f246c8)
    at /local/p4clients/pkgbuild-v3X3O/workspace/build/LibGCC/LibGCC-gcc.238862.0/AL2_x86_64/DEV.STD.PTHREAD/build/private/gcc-src/build/private/src/gcc-11.4.0-build2/x86_64-pc-linux-gnu/libstdc++-v3/include/x86_64-pc-linux-gnu/bits/gthr-default.h:865
#2  std::__condvar::wait (__m=..., this=0x7fcff3f246c8)
    at /local/p4clients/pkgbuild-v3X3O/workspace/build/LibGCC/LibGCC-gcc.238862.0/AL2_x86_64/DEV.STD.PTHREAD/build/private/gcc-src/build/private/src/gcc-11.4.0-build2/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/std_mutex.h:155
#3  std::condition_variable::wait (this=this@entry=0x7fcff3f246c8, __lock=...)
    at /local/p4clients/pkgbuild-v3X3O/workspace/build/LibGCC/LibGCC-gcc.238862.0/AL2_x86_64/DEV.STD.PTHREAD/build/private/gcc-src/build/private/src/gcc-11.4.0/libstdc++-v3/src/c++11/condition_variable.cc:41
#4  0x00007fd001f4d79b in std::condition_variable::wait<Aws::Utils::Threading::Semaphore::WaitOne()::<lambda()> > (__p=..., __lock=..., this=0x7fcff3f246c8)
    at /opt/brazil-pkg-cache/packages/GCC/GCC-11.x.1214.0/AL2_x86_64/DEV.STD.PTHREAD/build/gcc-11.4.0/include/c++/11.4.0/condition_variable:103
#5  Aws::Utils::Threading::Semaphore::WaitOne (this=0x7fcff3f24690)
    at /local/p4clients/pkgbuild-3Vovk/workspace/build/Aws-sdk-cpp-core/Aws-sdk-cpp-core-1.11.x_StdMemoryManagement.96405.0/AL2_x86_64/DEV.STD.PTHREAD/build/private/aws-sdk-cpp/src/aws-cpp-sdk-core/source/utils/threading/Semaphore.cpp:21
#6  0x00007fd001a0b76f in Aws::S3Crt::S3CrtClient::~S3CrtClient (this=0x7ffd3bf79cb0, __in_chrg=<optimized out>)
    at /opt/brazil-pkg-cache/packages/GCC/GCC-11.x.1214.0/AL2_x86_64/DEV.STD.PTHREAD/build/gcc-11.4.0/include/c++/11.4.0/bits/shared_ptr_base.h:984

Expected Behavior

The S3CrtClient's destructor should not hang. (This was the behavior, as of release 1.11.175.)

Current Behavior

The S3CrtClient's destructor hangs, as of release 1.11.195.

Reproduction Steps

We don't have a formal repro yet. I am trying to nail down the root cause of the hang -- it might be unexpected/incorrect usage of the AWS API, by our client application -- concurrently with filing this bug.

Possible Solution

No response

Additional Information/Context

No response

AWS CPP SDK version used

1.11.195

Compiler and Version used

gcc 11

Operating System and version

Linux 5.10.216-182.855.amzn2int.x86_64

hunjmes commented 1 month ago

Also hangs in release 1.11.328 .

hunjmes commented 1 month ago

Looks like function CrtClientShutdownCallback() is not being called, to Release() the S3CrtClient's m_clientShutdownSem semaphore field.

hunjmes commented 1 month ago

Problem seems to be that if the caller creates and destroys an S3CrtClient, using RAII pattern, then the constructor creates a semaphore field:

  m_clientShutdownSem = Aws::MakeShared<Threading::Semaphore>(ALLOCATION_TAG, 0, 1);

-- and passes that semaphore to the wrapped Crt Client:

  m_wrappedData.clientShutdownSem = m_clientShutdownSem;
  s3CrtConfig.shutdown_callback = CrtClientShutdownCallback;
  s3CrtConfig.shutdown_callback_user_data = static_cast<void*>(&m_wrappedData);
  m_s3CrtClient = aws_s3_client_new(Aws::get_aws_allocator(), &s3CrtConfig);

Then, inside the destructor:

  aws_s3_client_release(m_s3CrtClient);
  if(m_clientShutdownSem)
  {
    m_clientShutdownSem->WaitOne(); // Wait aws_s3_client shutdown
  }

However, the call to aws_s3_client_release(...) doesn't seem to be releasing the Crt client.

I wonder if there's an implicit assumption, at the Crt layer, that there is always only 1 S3 (crt) client alive, at a time? What's supposed to happen, if the caller does:

{
  ClientConfiguration config;
  S3CrtClient client1(config);
  S3CrtClient client2(config);
...
}

?

hunjmes commented 1 month ago

My hanging/failing test is a Google-test "crash test," and the hanging process is the process fork()ed by the Google-test framework. My other unit tests, that don't use the "crash test" framework, do not appear to hang.

I wonder if some "Crt" state that used to be tied to a given S3CrtClient object has now been made static?

DmitriyMusatkin commented 1 month ago

A quick note is that forking can lead to somewhat unexpected results currently when used in conjunction with CRT. CRT does rely on some global state that can end up being mangled during fork. This is something on the backlog for the team to improve, but we dont have a concrete ETA for that. As a workaround, you could try to avoid forking while CPP SDK is in initialized state, i.e. explicitly calling ShutdownAPI before forking the process.

hunjmes commented 1 month ago

Confirmed that the proposed workaround, to call ShutdownAPI before forking, works. Thanks!

jmklix commented 1 month ago

@hunjmes Did you have any other questions about this sdk?