Unbalanced InitializeGlobalStatics/CleanupGlobalStatics() calls in AWSClient and race in implementation

crusader-mike commented 6 years ago

Platform: Windows SDK version: 1.3.31

I am chasing a very rare crash (deep in AWS SDK) that happens in GetObject() call. After spending more than a day trying to reproduce it I finally caught one incidence in a debugger. Unfortunately nuget package doesn't contain .pdb files and symbols package was not refreshed since 2016.

If you have them laying around -- can you provide me with pdb files for x64 v140 Debug 1.3.31 build of -core and -s3 dlls, please?

Also, it would be really helpful if .symbols nuget packages were released along with redist packages (or maybe even included into them).

Code is very simple and I suspect a bug in SDK itself:

Aws::S3::Model::GetObjectRequest q;
q.SetKey( stream_info.obj_key );
static_assert(sizeof(unsigned long long) >= sizeof(streamoff), "");
q.SetRange(common::str_printf("bytes=%llu-%llu", static_cast<unsigned long long>(read_offset), static_cast<unsigned long long>(read_offset + chunk_size - 1)));
auto r = s3c.GetObject(q);                  // <-- this is where it happens

Stack:

    aws-cpp-sdk-core.dll!000007fed638169e() Unknown
    aws-cpp-sdk-core.dll!000007fed6382a0f() Unknown
    aws-cpp-sdk-core.dll!000007fed639a9a8() Unknown
    aws-cpp-sdk-core.dll!000007fed639a516() Unknown
    aws-cpp-sdk-s3.dll!000007fed66f45b1()   Unknown
    aws-cpp-sdk-core.dll!000007fed639503f() Unknown
    aws-cpp-sdk-core.dll!000007fed6393c7b() Unknown
    aws-cpp-sdk-core.dll!000007fed63958d6() Unknown
    aws-cpp-sdk-s3.dll!000007fed66e73e5()   Unknown
>   blah-blah-blah::download_file_(...)

JonathanHenson commented 6 years ago

Dumb question, are you calling InitApi() before you do any of this?

crusader-mike commented 6 years ago

:-) Yes. It is possible that this logic is invoked outside of "aws init" guard object scope. But I've checked -- there are 8 guard objects active at this moment -- SDK is definitely initialized.

There is at least a dozen of parallel threads retrieving other objects using the same download_file_() function right now. Could be some weird memory corruption, but it is hard to investigate anything without debug symbols.

one more note: crash is access violation at 0x00000020 address -- i.e. something somewhere is NULL when it shouldn't be. Offending code:

000007FED6381660  mov         qword ptr [rsp+8],rcx  
000007FED6381665  push        rdi  
000007FED6381666  sub         rsp,0B0h  
000007FED638166D  mov         rdi,rsp  
000007FED6381670  mov         ecx,2Ch  
000007FED6381675  mov         eax,0CCCCCCCCh  
000007FED638167A  rep stos    dword ptr [rdi]  
000007FED638167C  mov         rcx,qword ptr [rsp+0C0h]  
000007FED6381684  mov         rax,qword ptr [7FED662E5F0h]  
000007FED638168B  xor         rax,rsp  
000007FED638168E  mov         qword ptr [rsp+0A0h],rax  
000007FED6381696  mov         rax,qword ptr [rsp+0C0h]  
000007FED638169E  cmp         qword ptr [rax+8],0                      <--- here, rax is 24
000007FED63816A3  jne         000007FED63816A6  
000007FED63816A5  int         3  
000007FED63816A6  mov         rax,qword ptr [rsp+0C0h]  
000007FED63816AE  cmp         qword ptr [rax+10h],0  
000007FED63816B3  jne         000007FED63816B6  
000007FED63816B5  int         3  
000007FED63816B6  mov         rax,qword ptr [rsp+0C0h]  
000007FED63816BE  mov         eax,dword ptr [rax]  
000007FED63816C0  and         eax,100h  
000007FED63816C5  test        eax,eax

JonathanHenson commented 6 years ago

Also, nuget defaults don’t upgrade everything when you pull in a new dependency, can you check that all of the versions match?

marcomagdy commented 6 years ago

We don't keep pdb files around from earlier versions unfortunately. Also, Microsoft's debugger does some annoying timestamp checking to ensure the pdb file was created for the corresponding dll. So even if the code is the same it won't let you use the pdb.

Your best bet is to build from source. Also, aren't you using a custom streambuf? Did you rule out your implementation?

crusader-mike commented 6 years ago

@JonathanHenson

Also, nuget defaults don’t upgrade everything when you pull in a new dependency, can you check that all of the versions match?

Everything seems in order:

AWSSDKCPP-Core                      {1.3.31}
AWSSDKCPP-Core.redist               {1.3.31}
AWSSDKCPP-S3                        {1.3.20060301.31}
AWSSDKCPP-S3.redist                 {1.3.20060301.31}

@marcomagdy

Your best bet is to build from source.

Sigh... I'll have to figure out how to do it -- we were always using nuget packages on windows. Too bad this problem popped up like literally one day before major release. :-\

Even .map file can be useful here -- I'd be able to find in which function this happens using it's address.

Also, aren't you using a custom streambuf? Did you rule out your implementation?

Custom streambuf is used only for upload -- for download I opted for piece-by-piece download using SetRange() with default stream (std::stringstream, afair).

marcomagdy commented 6 years ago

@crusader-mike what credentials provider are you using?

crusader-mike commented 6 years ago

@marcomagdy I pass credentials directly into S3Client ctor:

Aws::S3::S3Client(Aws::Auth::AWSCredentials(s3_access_key_id, s3_secret_key),
                  awsClientConfiguration(s3_endpoint, s3_auth_region, opts_.s3_request_timeout_ms, opts_.s3_connect_timeout_ms, opts_.s3_max_connections));

crusader-mike commented 6 years ago

Huh... Upgrading SDK to 1.4.1 caused the problem to go away, it seems -- I can no longer reproduce it. What is more interesting -- CPU usage went down ~2 times and network utilization went up ~5 times (debug build).

On the bad side -- our QA will chew my head off. They've just finished multi-week regression tests.

crusader-mike commented 6 years ago

Actually, taking it back... With latest SDK it shows up in another place and is even harder to reproduce.

I spotted something though -- AWSClient ctor/dtor call InitializeGlobalStatics/CleanupGlobalStatics. And everything seems fine... Until you realize that generated move ctor is not gonna do that! Guess what -- my code moves S3Client instances. I am not sure that this is the root cause, but I'll keep digging.

Also, InitializeGlobalStatics/CleanupGlobalStatics implementation looks very weird to me (and I've done a fair amount of lockless programming) -- I highly doubt it is correct.

marcomagdy commented 6 years ago

The compiler won't generate a move ctor if there's a user-defined dtor; and in this case, there is one defined. shameless plug https://stackoverflow.com/questions/4943958/conditions-for-automatic-generation-of-default-copy-move-ctor-and-copy-move-assi/38257488#38257488

Which means we have a different kind of bug. When the client is copied, we don't increment the refcount.

crusader-mike commented 6 years ago

... but it will generate copy ctor, right? which won't call InitializeGlobalStatics()...

marcomagdy commented 6 years ago

Yes, correct. That's probably the bug. See my updated comment above. You can avoid this problem now by avoiding copying the client.

crusader-mike commented 6 years ago

Working on getting rid of "S3Client move" logic in our code and testing. (Edit: nope, crash is still there) Pretty sure there are 2 bugs here -- InitializeGlobalStatics/CleanupGlobalStatics() implementation is not good, I feel there is a race there.

Some background -- I have a S3Client cache:

    using s3cache_t = map<tuple<string, string, string, string>, Aws::S3::S3Client>;    // { (endpoint, key, secret, region) -> S3Client }
    s3cache_t s3cache_;

    Aws::S3::S3Client& get_client(opt_string const& endpoint, opt_string const& access_key_id, opt_string const& secret_key, opt_string const& auth_region) override
    {
        lock_guard<mutex> lock{mutex_};     // it is ok to reuse this mutex for cache protection

        auto& s3_endpoint       = endpoint      ? *endpoint         : opts_.s3_endpoint     ;
        auto& s3_access_key_id  = access_key_id ? *access_key_id    : opts_.s3_access_key_id;
        auto& s3_secret_key     = secret_key    ? *secret_key       : opts_.s3_secret_key   ;
        auto& s3_auth_region    = auth_region   ? *auth_region      : opts_.s3_auth_region  ;

        auto key = make_tuple(s3_endpoint, s3_access_key_id, s3_secret_key, s3_auth_region);
        auto it = s3cache_.lower_bound(key);
        if (it == s3cache_.end() || it->first != key)
            it = s3cache_.emplace_hint(it, key, Aws::S3::S3Client(Aws::Auth::AWSCredentials(s3_access_key_id, s3_secret_key),
                                                                  awsClientConfiguration(s3_endpoint, s3_auth_region, opts_.s3_request_timeout_ms, opts_.s3_connect_timeout_ms, opts_.s3_max_connections)));
        return it->second;
    }

As you see -- after all is done, I could end up with "global statics" not being initialized even though I have live instances of S3Client in cache. I think my crashes happen during destruction of this cache.

crusader-mike commented 6 years ago

@marcomagdy What cmake cmdline I should use to build the same binary as one you publish in nuget? I tried different combinations -- every time I end up with dll size that differs from dll in nuget package. (my download speed ends up being 4 times slower too)

crusader-mike commented 6 years ago

Caught it -- same problem as #781

>   aws-cpp-sdk-core.dll!Aws::External::tinyxml2::StrPair::GetStr() Line 199    C++
    aws-cpp-sdk-core.dll!Aws::External::tinyxml2::XMLNode::Value() Line 635 C++
    aws-cpp-sdk-core.dll!Aws::Utils::Xml::XmlNode::GetName() Line 59    C++
    aws-cpp-sdk-core.dll!Aws::Client::XmlErrorMarshaller::ContainsError(const Aws::Http::HttpResponse & httpResponse) Line 124  C++
    aws-cpp-sdk-s3.dll!Aws::S3::S3Client::DoesResponseGenerateError(const std::shared_ptr<Aws::Http::HttpResponse> & response) Line 2801    C++
    aws-cpp-sdk-core.dll!Aws::Client::AWSClient::AttemptOneRequest(const Aws::Http::URI & uri, const Aws::AmazonWebServiceRequest & request, Aws::Http::HttpMethod method, const char * signerName) Line 295    C++
    aws-cpp-sdk-core.dll!Aws::Client::AWSClient::AttemptExhaustively(const Aws::Http::URI & uri, const Aws::AmazonWebServiceRequest & request, Aws::Http::HttpMethod method, const char * signerName) Line 174  C++
    aws-cpp-sdk-core.dll!Aws::Client::AWSClient::MakeRequestWithUnparsedResponse(const Aws::Http::URI & uri, const Aws::AmazonWebServiceRequest & request, Aws::Http::HttpMethod method, const char * signerName) Line 341  C++
    aws-cpp-sdk-s3.dll!Aws::S3::S3Client::GetObject(const Aws::S3::Model::GetObjectRequest & request) Line 1441 C++

crusader-mike commented 6 years ago

To summarize:

AWSClient ctor/dtor call InitializeGlobalStatics/CleanupGlobalStatics(), but (if you copy/move S3Client object) generated copy ctor -- doesn't.
I believe InitializeGlobalStatics/CleanupGlobalStatics() is not good and has races in it. My suggestion is to stay away from lockless programming -- it is exceptionally tricky (and AWS SDK switches off exceptions :) )
I would appreciate if someone tell me cmake line used by official nuget build -- to be able to build same binaries locally (with debug symbols).

Thank you.

marcomagdy commented 6 years ago

Fixed in 1.4.55 see 832c839 for more details.

crusader-mike commented 6 years ago

This fix isn't good -- there are still race conditions here. For example, imagine a case of two threads calling AWSClient ctor and the first one (one that observed s_refCount++ == 0 being true) is preempted (or suspended) for a long time. Second thread will happily continue and eventually crash because global state isn't initialized (even though s_refCount is > 0).

Similarly, there is a race between dtor and ctor.

Edit: I can't reopen this issue (and too lazy to file a new one), gotta leave this on your conscience :-)

marcomagdy commented 6 years ago

Whoops, that’s embarrassing. We’ll get that fixed. Thanks for spotting it.

marcomagdy commented 6 years ago

@crusader-mike can you take a look at https://github.com/aws/aws-sdk-cpp/pull/879 see if I missed a use case?

crusader-mike commented 6 years ago

Nope. You can't use atomic variable here to properly protect global state (unless your entire state is atomic variable). Think about it -- for other threads at any point in time there should be only two observable states: "s_refCount == 0 and global state is completely unitialized" and "s_refCount != 0 and global state is completely initialized". You are guaranteed to break this invariant by modifying s_refCount and global state separately (from other thread perspective). You have to use mutex here to "merge" these modifications into one (in eyes of observers).

P.S. I am probably repeating myself -- but stay away from lockless programming unless you are ready to sink significant time into it. Even if you get to the point when you can comfortably write such code -- necessary knowledge will take so much space in your brain that it will probably push out most of other C++-related stuff :)

marcomagdy commented 6 years ago

I think I'll take @singku's suggestion and initialize this container in Aws::init. This whole ordeal was about trying to optimize that container so that it doesn't grow unbounded. But that reasoning is not strong enough to complicate the constructor's logic (and definitely doesn't warrant a mutex).

stay away from lockless programming unless you are ready to sink significant time into it.

I appreciate your concern. But if I'm going to stay away from hard programming paradigms, I might as well go program in JavaScript or Ruby 😄. But in all seriousness, I usually have my race-conditions covered well, and we have good engineers on the team that catch the things I miss. But obviously this case is an exception. I dropped the ball, but I'm glad you caught it.

crusader-mike commented 6 years ago

It isn't just a concern -- I think every time I stumbled upon an atomic variable in SDK code there was a race condition nearby.

You can achieve similar effect using static variables or std::call_once -- though afaik they are implemented using mutex.

Yes, moving this into AwsInit sounds like a good idea.

atomic variables isn't necessarily better than mutex -- each atomic variable has a "mutex" associated with it (google MESI protocol) and price of using it grows with number of processors in your system and gets especially bad on modern servers (due to NUMA). A lot of wait-free algorithms turned out to be slower than straightforward mutex-protected code simply because with mutex you pay this price only twice (lock + unlock) but with atomic variables -- on every access.

If you are interested in lockless programming I suggest reading/watching these:

there were two presentations Herb Sutter did about C++11 memory model -- it should not be hard to find them in google. Don't trust everything he says, though -- I think there were few mistakes made :)
Memory Consistency Models for Shared-Memory Multiprocessors -- afaik this is a foundation on which current Java/C++ memory model is built on. It is pointless to learn everything in this paper, but it makes sense to familiarize yourself with it.
ABA problem
Futexes are tricky
Wait-Free Synchronization
A methodology for implementing highly concurrent data objects

JonathanHenson commented 6 years ago

IIRC this code was written before aws init existed.

WRT luckless stuff, I agree.

Another aspect here in hindsight, lockless algorithms aren’t great for long term code viability. Even if they are correct when written, it increases cognitive load for all future developers and will inevitably be broken by future changes.

crusader-mike commented 6 years ago

luckless™ code is amazing for generating cognitive load, lol. And sometimes it takes people years to realize that "it broke".

marcomagdy commented 6 years ago

simply because with mutex you pay this price only twice (lock + unlock) but with atomic variables -- on every access.

That is not true. An atomic read of register-wide variables, like pointers and integers (at least on x86) is simply a memory load. Locking a mutex is a system call.

crusader-mike commented 6 years ago

My low-lvl x86 is a bit rusty, but I am sure there is a fence on read -- even if you don't observe in generated asm because x86 is naturally consistent (and acquire-read is just a read), it still exists on higher level (compiler can't reorder certain things around it).

Also, afaik, on modern systems locking mutex is system call (or better to say -- switch to kernel mode) only if you have to wait. Grabbing mutex that isn't locked is inexpensive. In pre-futex times on Linux mutex was expensive, yes.

JonathanHenson commented 6 years ago

Unfortunately I’ve had to read a ton of clang and llvm source code lately and pthread_mutex using the fast mutex init is only a system call under contention. The same is true for SRWLocks on windows... std::mutex simply calls into those apis.

The days of critical sections and full mutex objects on windows are, thankfully, behind us..

JonathanHenson commented 6 years ago

Also, fwiw, you’re screwed on ARM, SPARC, and PowerPC regardless of which you use.

crusader-mike commented 6 years ago

@JonathanHenson Are you referring to this or the fact that on these platforms memory model is more relaxed than C++11 model (which means you can't use C++ to squeeze the maximum performance-wise)?

marcomagdy commented 6 years ago

Fixed in v1.4.57

crusader-mike commented 6 years ago

Small note: since you've left ~AWSClient() in place it probably makes sense to add move ctor and op=? E.g. smth like AWSClient(AWSClient&&) = default;. Not sure if this will force you to do the same for copy ctor/op=.

marcomagdy commented 6 years ago

I have to leave the dtor so it can be marked as virtual. I did have a defaulted copy/move ctors and assignment operators, but the code failed to build on VS2013, which unfortunately we still support. It is by far the worst C++11 compliant compiler.

crusader-mike commented 6 years ago

Yeah, VS2013 was rough around the edges... Can you work around that like this:

struct goddamitbill { virtual ~goddamitbill() = default; };

class AwsClient : public goddamitbill { ... };

?

aws / aws-sdk-cpp

Unbalanced InitializeGlobalStatics/CleanupGlobalStatics() calls in AWSClient and race in implementation #815