aws / aws-sdk-cpp

AWS SDK for C++
Apache License 2.0
1.95k stars 1.05k forks source link

Seeing curl_easy_perform stuck at aws-sdk 1.7.336 #1861

Closed sihanwang41 closed 2 years ago

sihanwang41 commented 2 years ago

Describe the issue

We are using tensorflow 2.6 (by default, it is using aws-sdk-cpp 1.7.336).

The issue doesn't always happen, but it happens quite often on some of host in one big cluster. We tried to set httpRequestTimeoutMs with 10s, retry 10 times is able to help to resolve the issue.

We have hundreds of hosts (500 -1000) will query the same object at the near same time.

Thread 123 (Thread 0x7f954c3cc700 (LWP 321)):

0 0x00007f9b01954cb9 in poll () from ./libc.so.6

1 0x0000557e95c5bec2 in Curl_poll () at /usr/include/c++/8/ext/new_allocator.h:86

2 0x0000557e95c56e89 in multi_wait.part () at /usr/include/c++/8/ext/new_allocator.h:86

3 0x0000557e95c57079 in curl_multi_poll () at /usr/include/c++/8/ext/new_allocator.h:86

4 0x0000557e95c4b3b3 in curl_easy_perform () at /usr/include/c++/8/ext/new_allocator.h:86

5 0x0000557e95a7ab4b in Aws::Http::CurlHttpClient::MakeRequestInternal(Aws::Http::HttpRequest&, std::shared_ptr&, Aws::Utils::RateLimits::RateLimiterInterface, Aws::Utils::RateLimits::RateLimiterInterface) const () at /usr/include/c++/8/ext/new_allocator.h:86

6 0x0000557e95a7cc59 in Aws::Http::CurlHttpClient::MakeRequest(std::shared_ptr const&, Aws::Utils::RateLimits::RateLimiterInterface, Aws::Utils::RateLimits::RateLimiterInterface) const () at /usr/include/c++/8/ext/new_allocator.h:86

7 0x0000557e95bfcda6 in Aws::Client::AWSClient::AttemptOneRequest(std::shared_ptr const&, Aws::AmazonWebServiceRequest const&, char const*) const ()

at /usr/include/c++/8/ext/new_allocator.h:86

8 0x0000557e95bfd3f4 in Aws::Client::AWSClient::AttemptExhaustively(Aws::Http::URI const&, Aws::AmazonWebServiceRequest const&, Aws::Http::HttpMethod, char const*) const ()

at /usr/include/c++/8/ext/new_allocator.h:86

9 0x0000557e95bfe245 in Aws::Client::AWSClient::MakeRequestWithUnparsedResponse(Aws::Http::URI const&, Aws::AmazonWebServiceRequest const&, Aws::Http::HttpMethod, char const*) const ()

at /usr/include/c++/8/ext/new_allocator.h:86

10 0x0000557e95ad9b4f in Aws::S3::S3Client::GetObject(Aws::S3::Model::GetObjectRequest const&) const () at /usr/include/c++/8/ext/new_allocator.h:86

11 0x0000557e95a573dd in tensorflow::(anonymous namespace)::S3RandomAccessFile::ReadS3Client (scratch=0x7f92b8c01580 "", result=0x7f954c3b4b80, n=, offset=,

this=0x7f8c07542d10) at /usr/include/c++/8/bits/shared_ptr_base.h:1018

12 tensorflow::(anonymous namespace)::S3RandomAccessFile::Read(unsigned long, unsigned long, std::basic_string_view<char, std::char_traits >, char) const ()

at external/org_tensorflow/tensorflow/core/platform/s3/s3_file_system.cc:255

13 0x0000557e95a17d0f in tensorflow::retrying_internals::RetryingRandomAccessFile::Read(unsigned long, unsigned long, std::basic_string_view<char, std::char_traits >, char) const::{lambda()#1}::operator()() const (__closure=) at /usr/include/c++/8/bits/unique_ptr.h:345

14 std::_Function_handler<tensorflow::Status (), tensorflow::retrying_internals::RetryingRandomAccessFile::Read(unsigned long, unsigned long, std::basic_string_view<char, std::char_traits >, char) const::{lambda()#1}>::_M_invoke(std::_Any_data const&) (__functor=...) at /usr/include/c++/8/bits/std_function.h:283

15 0x0000557e95a41e7a in std::function<tensorflow::Status ()>::operator()() const (this=0x7f954c3b4a10) at /usr/include/c++/8/bits/std_function.h:682

16 tensorflow::RetryingUtils::CallWithRetries(std::function<tensorflow::Status ()> const&, std::function<void (long)> const&, tensorflow::RetryConfig const&) (f=..., sleep_usec=..., config=...)

at external/org_tensorflow/tensorflow/core/platform/retrying_utils.cc:54

17 0x0000557e95a42512 in tensorflow::RetryingUtils::CallWithRetries(std::function<tensorflow::Status ()> const&, tensorflow::RetryConfig const&) (f=..., config=...) at /usr/include/c++/8/new:169

18 0x0000557e95a18c89 in tensorflow::retrying_internals::RetryingRandomAccessFile::Read (this=, offset=955128096, n=83425632, result=0x7f954c3b4b80, scratch=0x7f92b8c01580 "")

at /usr/include/c++/8/bits/std_function.h:87

19 0x0000557e91773cbd in tensorflow::BundleReader::GetValue (this=this@entry=0x7f954c3b5570, entry=..., val=val@entry=0x7f8c08633820)

at bazel-out/k8-opt/bin/external/org_tensorflow/tensorflow/core/protobuf/tensor_bundle.pb.h:641

20 0x0000557e9177dc9d in tensorflow::BundleReader::Lookup(std::basic_string_view<char, std::char_traits >, tensorflow::Tensor*) ()

at external/org_tensorflow/tensorflow/core/util/tensor_bundle/tensor_bundle.cc:947

21 0x0000557e8dc587f1 in tensorflow::(anonymous namespace)::RestoreOp::run(tensorflow::BundleReader*) () at external/eigen_archive/unsupported/Eigen/CXX11/src/Tensor/TensorMorphing.h:653

Steps to Reproduce

No response

Current behavior

No response

AWS CPP SDK version used

1.7.336

compiler and version used

6.5.0

Operating System and version

UBUNTU 18.04

KaibaLopez commented 2 years ago

Hi @sihanwang41 , Kind of hard to help when you are using such an old version of the SDK and in conjunction with a 3rd party. But yea increasing request timeout and number of retries would be the proposed workaround for these.