Closed crusader-mike closed 6 years ago
Dumb question, are you calling InitApi() before you do any of this?
:-) Yes. It is possible that this logic is invoked outside of "aws init" guard object scope. But I've checked -- there are 8 guard objects active at this moment -- SDK is definitely initialized.
There is at least a dozen of parallel threads retrieving other objects using the same download_file_()
function right now. Could be some weird memory corruption, but it is hard to investigate anything without debug symbols.
one more note: crash is access violation at 0x00000020 address -- i.e. something somewhere is NULL when it shouldn't be. Offending code:
000007FED6381660 mov qword ptr [rsp+8],rcx
000007FED6381665 push rdi
000007FED6381666 sub rsp,0B0h
000007FED638166D mov rdi,rsp
000007FED6381670 mov ecx,2Ch
000007FED6381675 mov eax,0CCCCCCCCh
000007FED638167A rep stos dword ptr [rdi]
000007FED638167C mov rcx,qword ptr [rsp+0C0h]
000007FED6381684 mov rax,qword ptr [7FED662E5F0h]
000007FED638168B xor rax,rsp
000007FED638168E mov qword ptr [rsp+0A0h],rax
000007FED6381696 mov rax,qword ptr [rsp+0C0h]
000007FED638169E cmp qword ptr [rax+8],0 <--- here, rax is 24
000007FED63816A3 jne 000007FED63816A6
000007FED63816A5 int 3
000007FED63816A6 mov rax,qword ptr [rsp+0C0h]
000007FED63816AE cmp qword ptr [rax+10h],0
000007FED63816B3 jne 000007FED63816B6
000007FED63816B5 int 3
000007FED63816B6 mov rax,qword ptr [rsp+0C0h]
000007FED63816BE mov eax,dword ptr [rax]
000007FED63816C0 and eax,100h
000007FED63816C5 test eax,eax
Also, nuget defaults don’t upgrade everything when you pull in a new dependency, can you check that all of the versions match?
We don't keep pdb files around from earlier versions unfortunately. Also, Microsoft's debugger does some annoying timestamp checking to ensure the pdb file was created for the corresponding dll. So even if the code is the same it won't let you use the pdb.
Your best bet is to build from source. Also, aren't you using a custom streambuf? Did you rule out your implementation?
@JonathanHenson
Also, nuget defaults don’t upgrade everything when you pull in a new dependency, can you check that all of the versions match?
Everything seems in order:
AWSSDKCPP-Core {1.3.31}
AWSSDKCPP-Core.redist {1.3.31}
AWSSDKCPP-S3 {1.3.20060301.31}
AWSSDKCPP-S3.redist {1.3.20060301.31}
@marcomagdy
Your best bet is to build from source.
Sigh... I'll have to figure out how to do it -- we were always using nuget packages on windows. Too bad this problem popped up like literally one day before major release. :-\
Even .map file can be useful here -- I'd be able to find in which function this happens using it's address.
Also, aren't you using a custom streambuf? Did you rule out your implementation?
Custom streambuf is used only for upload -- for download I opted for piece-by-piece download using SetRange()
with default stream (std::stringstream, afair).
@crusader-mike what credentials provider are you using?
@marcomagdy I pass credentials directly into S3Client ctor:
Aws::S3::S3Client(Aws::Auth::AWSCredentials(s3_access_key_id, s3_secret_key),
awsClientConfiguration(s3_endpoint, s3_auth_region, opts_.s3_request_timeout_ms, opts_.s3_connect_timeout_ms, opts_.s3_max_connections));
Huh... Upgrading SDK to 1.4.1 caused the problem to go away, it seems -- I can no longer reproduce it. What is more interesting -- CPU usage went down ~2 times and network utilization went up ~5 times (debug build).
On the bad side -- our QA will chew my head off. They've just finished multi-week regression tests.
Actually, taking it back... With latest SDK it shows up in another place and is even harder to reproduce.
I spotted something though -- AWSClient ctor/dtor call InitializeGlobalStatics/CleanupGlobalStatics. And everything seems fine... Until you realize that generated move ctor is not gonna do that! Guess what -- my code moves S3Client instances. I am not sure that this is the root cause, but I'll keep digging.
Also, InitializeGlobalStatics/CleanupGlobalStatics implementation looks very weird to me (and I've done a fair amount of lockless programming) -- I highly doubt it is correct.
The compiler won't generate a move ctor if there's a user-defined dtor; and in this case, there is one defined. shameless plug https://stackoverflow.com/questions/4943958/conditions-for-automatic-generation-of-default-copy-move-ctor-and-copy-move-assi/38257488#38257488
Which means we have a different kind of bug. When the client is copied, we don't increment the refcount.
... but it will generate copy ctor, right? which won't call InitializeGlobalStatics()
...
Yes, correct. That's probably the bug. See my updated comment above. You can avoid this problem now by avoiding copying the client.
Working on getting rid of "S3Client move" logic in our code and testing. (Edit: nope, crash is still there) Pretty sure there are 2 bugs here -- InitializeGlobalStatics/CleanupGlobalStatics()
implementation is not good, I feel there is a race there.
Some background -- I have a S3Client cache:
using s3cache_t = map<tuple<string, string, string, string>, Aws::S3::S3Client>; // { (endpoint, key, secret, region) -> S3Client }
s3cache_t s3cache_;
Aws::S3::S3Client& get_client(opt_string const& endpoint, opt_string const& access_key_id, opt_string const& secret_key, opt_string const& auth_region) override
{
lock_guard<mutex> lock{mutex_}; // it is ok to reuse this mutex for cache protection
auto& s3_endpoint = endpoint ? *endpoint : opts_.s3_endpoint ;
auto& s3_access_key_id = access_key_id ? *access_key_id : opts_.s3_access_key_id;
auto& s3_secret_key = secret_key ? *secret_key : opts_.s3_secret_key ;
auto& s3_auth_region = auth_region ? *auth_region : opts_.s3_auth_region ;
auto key = make_tuple(s3_endpoint, s3_access_key_id, s3_secret_key, s3_auth_region);
auto it = s3cache_.lower_bound(key);
if (it == s3cache_.end() || it->first != key)
it = s3cache_.emplace_hint(it, key, Aws::S3::S3Client(Aws::Auth::AWSCredentials(s3_access_key_id, s3_secret_key),
awsClientConfiguration(s3_endpoint, s3_auth_region, opts_.s3_request_timeout_ms, opts_.s3_connect_timeout_ms, opts_.s3_max_connections)));
return it->second;
}
As you see -- after all is done, I could end up with "global statics" not being initialized even though I have live instances of S3Client in cache. I think my crashes happen during destruction of this cache.
@marcomagdy What cmake cmdline I should use to build the same binary as one you publish in nuget? I tried different combinations -- every time I end up with dll size that differs from dll in nuget package. (my download speed ends up being 4 times slower too)
Caught it -- same problem as #781
> aws-cpp-sdk-core.dll!Aws::External::tinyxml2::StrPair::GetStr() Line 199 C++
aws-cpp-sdk-core.dll!Aws::External::tinyxml2::XMLNode::Value() Line 635 C++
aws-cpp-sdk-core.dll!Aws::Utils::Xml::XmlNode::GetName() Line 59 C++
aws-cpp-sdk-core.dll!Aws::Client::XmlErrorMarshaller::ContainsError(const Aws::Http::HttpResponse & httpResponse) Line 124 C++
aws-cpp-sdk-s3.dll!Aws::S3::S3Client::DoesResponseGenerateError(const std::shared_ptr<Aws::Http::HttpResponse> & response) Line 2801 C++
aws-cpp-sdk-core.dll!Aws::Client::AWSClient::AttemptOneRequest(const Aws::Http::URI & uri, const Aws::AmazonWebServiceRequest & request, Aws::Http::HttpMethod method, const char * signerName) Line 295 C++
aws-cpp-sdk-core.dll!Aws::Client::AWSClient::AttemptExhaustively(const Aws::Http::URI & uri, const Aws::AmazonWebServiceRequest & request, Aws::Http::HttpMethod method, const char * signerName) Line 174 C++
aws-cpp-sdk-core.dll!Aws::Client::AWSClient::MakeRequestWithUnparsedResponse(const Aws::Http::URI & uri, const Aws::AmazonWebServiceRequest & request, Aws::Http::HttpMethod method, const char * signerName) Line 341 C++
aws-cpp-sdk-s3.dll!Aws::S3::S3Client::GetObject(const Aws::S3::Model::GetObjectRequest & request) Line 1441 C++
To summarize:
InitializeGlobalStatics/CleanupGlobalStatics()
, but (if you copy/move S3Client object) generated copy ctor -- doesn't.InitializeGlobalStatics/CleanupGlobalStatics()
is not good and has races in it. My suggestion is to stay away from lockless programming -- it is exceptionally tricky (and AWS SDK switches off exceptions :) )Thank you.
Fixed in 1.4.55 see 832c839 for more details.
This fix isn't good -- there are still race conditions here. For example, imagine a case of two threads calling AWSClient
ctor and the first one (one that observed s_refCount++ == 0
being true) is preempted (or suspended) for a long time. Second thread will happily continue and eventually crash because global state isn't initialized (even though s_refCount is > 0).
Similarly, there is a race between dtor and ctor.
Edit: I can't reopen this issue (and too lazy to file a new one), gotta leave this on your conscience :-)
Whoops, that’s embarrassing. We’ll get that fixed. Thanks for spotting it.
@crusader-mike can you take a look at https://github.com/aws/aws-sdk-cpp/pull/879 see if I missed a use case?
Nope. You can't use atomic variable here to properly protect global state (unless your entire state is atomic variable). Think about it -- for other threads at any point in time there should be only two observable states: "s_refCount == 0 and global state is completely unitialized" and "s_refCount != 0 and global state is completely initialized". You are guaranteed to break this invariant by modifying s_refCount and global state separately (from other thread perspective). You have to use mutex here to "merge" these modifications into one (in eyes of observers).
P.S. I am probably repeating myself -- but stay away from lockless programming unless you are ready to sink significant time into it. Even if you get to the point when you can comfortably write such code -- necessary knowledge will take so much space in your brain that it will probably push out most of other C++-related stuff :)
I think I'll take @singku's suggestion and initialize this container in Aws::init. This whole ordeal was about trying to optimize that container so that it doesn't grow unbounded. But that reasoning is not strong enough to complicate the constructor's logic (and definitely doesn't warrant a mutex).
stay away from lockless programming unless you are ready to sink significant time into it.
I appreciate your concern. But if I'm going to stay away from hard programming paradigms, I might as well go program in JavaScript or Ruby 😄. But in all seriousness, I usually have my race-conditions covered well, and we have good engineers on the team that catch the things I miss. But obviously this case is an exception. I dropped the ball, but I'm glad you caught it.
It isn't just a concern -- I think every time I stumbled upon an atomic variable in SDK code there was a race condition nearby.
You can achieve similar effect using static variables or std::call_once -- though afaik they are implemented using mutex.
Yes, moving this into AwsInit sounds like a good idea.
atomic variables isn't necessarily better than mutex -- each atomic variable has a "mutex" associated with it (google MESI protocol) and price of using it grows with number of processors in your system and gets especially bad on modern servers (due to NUMA). A lot of wait-free algorithms turned out to be slower than straightforward mutex-protected code simply because with mutex you pay this price only twice (lock + unlock) but with atomic variables -- on every access.
If you are interested in lockless programming I suggest reading/watching these:
IIRC this code was written before aws init existed.
WRT luckless stuff, I agree.
Another aspect here in hindsight, lockless algorithms aren’t great for long term code viability. Even if they are correct when written, it increases cognitive load for all future developers and will inevitably be broken by future changes.
luckless™ code is amazing for generating cognitive load, lol. And sometimes it takes people years to realize that "it broke".
simply because with mutex you pay this price only twice (lock + unlock) but with atomic variables -- on every access.
That is not true. An atomic read of register-wide variables, like pointers and integers (at least on x86) is simply a memory load. Locking a mutex is a system call.
My low-lvl x86 is a bit rusty, but I am sure there is a fence on read -- even if you don't observe in generated asm because x86 is naturally consistent (and acquire-read is just a read), it still exists on higher level (compiler can't reorder certain things around it).
Also, afaik, on modern systems locking mutex is system call (or better to say -- switch to kernel mode) only if you have to wait. Grabbing mutex that isn't locked is inexpensive. In pre-futex times on Linux mutex was expensive, yes.
Unfortunately I’ve had to read a ton of clang and llvm source code lately and pthread_mutex using the fast mutex init is only a system call under contention. The same is true for SRWLocks on windows... std::mutex simply calls into those apis.
The days of critical sections and full mutex objects on windows are, thankfully, behind us..
Also, fwiw, you’re screwed on ARM, SPARC, and PowerPC regardless of which you use.
@JonathanHenson Are you referring to this or the fact that on these platforms memory model is more relaxed than C++11 model (which means you can't use C++ to squeeze the maximum performance-wise)?
Fixed in v1.4.57
Small note: since you've left ~AWSClient()
in place it probably makes sense to add move ctor and op=? E.g. smth like AWSClient(AWSClient&&) = default;
. Not sure if this will force you to do the same for copy ctor/op=.
I have to leave the dtor so it can be marked as virtual. I did have a defaulted copy/move ctors and assignment operators, but the code failed to build on VS2013, which unfortunately we still support. It is by far the worst C++11 compliant compiler.
Yeah, VS2013 was rough around the edges... Can you work around that like this:
struct goddamitbill { virtual ~goddamitbill() = default; };
class AwsClient : public goddamitbill { ... };
?
Platform: Windows SDK version: 1.3.31
I am chasing a very rare crash (deep in AWS SDK) that happens in
GetObject()
call. After spending more than a day trying to reproduce it I finally caught one incidence in a debugger. Unfortunately nuget package doesn't contain .pdb files and symbols package was not refreshed since 2016.If you have them laying around -- can you provide me with pdb files for x64 v140 Debug 1.3.31 build of -core and -s3 dlls, please?
Also, it would be really helpful if .symbols nuget packages were released along with redist packages (or maybe even included into them).
Code is very simple and I suspect a bug in SDK itself:
Stack: