Azure / azure-sdk-for-cpp

This repository is for active development of the Azure SDK for C++. For consumers of the SDK we recommend visiting our versioned developer docs at https://azure.github.io/azure-sdk-for-cpp.
MIT License
181 stars 126 forks source link

SDK hang in libcurl curl_easy_recv() #5379

Closed bzhou-sw closed 8 months ago

bzhou-sw commented 9 months ago

Describe the bug

We have a binary (running on ubuntu) using azure-sdk-for-cpp to fetch NSG flow logs from Azure cloud. Some users reported hang problem. We sent SIGABRT to the process to create core dump and saw it is inside libcurl curl_easy_recv() called from Azure::Core::Http::CurlConnection::ReadFromSocket

Exception or Stack Trace

Program terminated with signal SIGABRT, Aborted.

warning: Section `.reg-xstate/5475' in core file too small.
#0  __libc_read (nbytes=5, buf=0x55663d0870c3, fd=81) at ../sysdeps/unix/sysv/linux/read.c:26
26      ../sysdeps/unix/sysv/linux/read.c: No such file or directory.
[Current thread is 1 (Thread 0x7ff392cc9400 (LWP 5475))]
(gdb) bt
#0  __libc_read (nbytes=5, buf=0x55663d0870c3, fd=81) at ../sysdeps/unix/sysv/linux/read.c:26
#1  __libc_read (fd=81, buf=0x55663d0870c3, nbytes=5) at ../sysdeps/unix/sysv/linux/read.c:24
#2  0x00007ff395ae5409 in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.1.1
#3  0x00007ff395ae06ae in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.1.1
#4  0x00007ff395adf504 in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.1.1
#5  0x00007ff395adfad7 in BIO_read () from /lib/x86_64-linux-gnu/libcrypto.so.1.1
#6  0x00007ff395744b91 in ?? () from /lib/x86_64-linux-gnu/libssl.so.1.1
#7  0x00007ff395748e1e in ?? () from /lib/x86_64-linux-gnu/libssl.so.1.1
#8  0x00007ff3957466d0 in ?? () from /lib/x86_64-linux-gnu/libssl.so.1.1
#9  0x00007ff39574dc45 in ?? () from /lib/x86_64-linux-gnu/libssl.so.1.1
#10 0x00007ff395758a3f in ?? () from /lib/x86_64-linux-gnu/libssl.so.1.1
#11 0x00007ff395758b47 in SSL_read () from /lib/x86_64-linux-gnu/libssl.so.1.1
#12 0x00007ff395d77a19 in ?? () from /lib/x86_64-linux-gnu/libcurl.so.4
#13 0x00007ff395d2ae4b in ?? () from /lib/x86_64-linux-gnu/libcurl.so.4
#14 0x00007ff395d3fde0 in curl_easy_recv () from /lib/x86_64-linux-gnu/libcurl.so.4
#15 0x00007ff3961f63c4 in Azure::Core::Http::CurlConnection::ReadFromSocket(unsigned char*, unsigned long, Azure::Core::Context const&) () from /lib/x86_64-linux-gnu/libazure-core.so
#16 0x00007ff3961f614e in Azure::Core::Http::CurlSession::ParseChunkSize(Azure::Core::Context const&) () from /lib/x86_64-linux-gnu/libazure-core.so
#17 0x00007ff3961f65ae in Azure::Core::Http::CurlSession::OnRead(unsigned char*, unsigned long, Azure::Core::Context const&) () from /lib/x86_64-linux-gnu/libazure-core.so
#18 0x00007ff396224f6c in Azure::Core::IO::BodyStream::ReadToCount(unsigned char*, unsigned long, Azure::Core::Context const&) () from /lib/x86_64-linux-gnu/libazure-core.so
#19 0x00007ff3962251fa in Azure::Core::IO::BodyStream::ReadToEnd(Azure::Core::Context const&) () from /lib/x86_64-linux-gnu/libazure-core.so
#20 0x00007ff396220a0e in Azure::Core::Http::Policies::_internal::TransportPolicy::Send(Azure::Core::Http::Request&, Azure::Core::Http::Policies::NextHttpPolicy, Azure::Core::Context const&) const () from /lib/x86_64-linux-gnu/libazure-core.so
#21 0x00007ff39621b298 in Azure::Core::Http::Policies::NextHttpPolicy::Send(Azure::Core::Http::Request&, Azure::Core::Context const&) () from /lib/x86_64-linux-gnu/libazure-core.so
#22 0x00007ff39621920b in Azure::Core::Http::Policies::_internal::LogPolicy::Send(Azure::Core::Http::Request&, Azure::Core::Http::Policies::NextHttpPolicy, Azure::Core::Context const&) const () from /lib/x86_64-linux-gnu/libazure-core.so
#23 0x00007ff39621b298 in Azure::Core::Http::Policies::NextHttpPolicy::Send(Azure::Core::Http::Request&, Azure::Core::Context const&) () from /lib/x86_64-linux-gnu/libazure-core.so
#24 0x00007ff39621cd47 in Azure::Core::Http::Policies::_internal::RequestActivityPolicy::Send(Azure::Core::Http::Request&, Azure::Core::Http::Policies::NextHttpPolicy, Azure::Core::Context const&) const () from /lib/x86_64-linux-gnu/libazure-core.so
#25 0x00007ff39621b298 in Azure::Core::Http::Policies::NextHttpPolicy::Send(Azure::Core::Http::Request&, Azure::Core::Context const&) () from /lib/x86_64-linux-gnu/libazure-core.so
#26 0x00007ff3958cf9af in Azure::Storage::_internal::StoragePerRetryPolicy::Send(Azure::Core::Http::Request&, Azure::Core::Http::Policies::NextHttpPolicy, Azure::Core::Context const&) const
    () from /lib/x86_64-linux-gnu/libazure-storage-common.so
#27 0x00007ff39621b298 in Azure::Core::Http::Policies::NextHttpPolicy::Send(Azure::Core::Http::Request&, Azure::Core::Context const&) () from /lib/x86_64-linux-gnu/libazure-core.so
#28 0x00007ff3958d01ef in Azure::Storage::_internal::StorageSwitchToSecondaryPolicy::Send(Azure::Core::Http::Request&, Azure::Core::Http::Policies::NextHttpPolicy, Azure::Core::Context const&) const () from /lib/x86_64-linux-gnu/libazure-storage-common.so
#29 0x00007ff39621b298 in Azure::Core::Http::Policies::NextHttpPolicy::Send(Azure::Core::Http::Request&, Azure::Core::Context const&) () from /lib/x86_64-linux-gnu/libazure-core.so
#30 0x00007ff39621f853 in Azure::Core::Http::Policies::_internal::RetryPolicy::Send(Azure::Core::Http::Request&, Azure::Core::Http::Policies::NextHttpPolicy, Azure::Core::Context const&) const () from /lib/x86_64-linux-gnu/libazure-core.so
#31 0x00007ff39621b298 in Azure::Core::Http::Policies::NextHttpPolicy::Send(Azure::Core::Http::Request&, Azure::Core::Context const&) () from /lib/x86_64-linux-gnu/libazure-core.so
#32 0x00007ff396220596 in Azure::Core::Http::Policies::_internal::TelemetryPolicy::Send(Azure::Core::Http::Request&, Azure::Core::Http::Policies::NextHttpPolicy, Azure::Core::Context const&) const () from /lib/x86_64-linux-gnu/libazure-core.so
#33 0x00007ff39621b298 in Azure::Core::Http::Policies::NextHttpPolicy::Send(Azure::Core::Http::Request&, Azure::Core::Context const&) () from /lib/x86_64-linux-gnu/libazure-core.so
#34 0x00007ff39621b255 in Azure::Core::Http::Policies::NextHttpPolicy::Send(Azure::Core::Http::Request&, Azure::Core::Context const&) () from /lib/x86_64-linux-gnu/libazure-core.so
#35 0x00007ff3962b7b64 in Azure::Storage::_internal::StorageServiceVersionPolicy::Send(Azure::Core::Http::Request&, Azure::Core::Http::Policies::NextHttpPolicy, Azure::Core::Context const&) const () from /lib/x86_64-linux-gnu/libazure-storage-blobs.so
#36 0x00007ff3963698e7 in Azure::Storage::Blobs::_detail::BlockBlobClient::GetBlockList(Azure::Core::Http::_internal::HttpPipeline&, Azure::Core::Url const&, Azure::Storage::Blobs::_detail::BlockBlobClient::GetBlockBlobBlockListOptions const&, Azure::Core::Context const&) () from /lib/x86_64-linux-gnu/libazure-storage-blobs.so
#37 0x00007ff3962f5184 in Azure::Storage::Blobs::BlockBlobClient::GetBlockList(Azure::Storage::Blobs::GetBlockListOptions const&, Azure::Core::Context const&) const ()
   from /lib/x86_64-linux-gnu/libazure-storage-blobs.so
#38 0x000055663bcbaca3 in proc_leaf_blobs(Azure::Storage::Blobs::BlobContainerClient&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, char*, long, long, int) ()
#39 0x000055663bcbe8ab in azure_nsg_loop ()
#40 0x000055663bc8d737 in main ()

To Reproduce

With the same binary, we could not reproduce the issue in our lab, but multiple users observed the issue happened randomly on their systems. Maybe it is related to some network activity or characteristic?

Code Snippet Add the code snippet that causes the issue.

Expected behavior SDK should not hang.

Screenshots If applicable, add screenshots to help explain your problem.

Setup (please complete the following information):

Additional context

Information Checklist Kindly make sure that you have added all the following information above and checkoff the required fields otherwise we will treat the issuer as an incomplete report

bzhou-sw commented 9 months ago

Code calling GetBlockList():

...
  auto blockBlobClient = containerClient.GetBlockBlobClient(blob_name);
  auto blkListResp = blockBlobClient.GetBlockList();
  size_t n_blocks = blkListResp.Value.CommittedBlocks.size();
...
LarryOsterman commented 9 months ago

@Jinming-Hu, @vinjiang This looks like the service didn't respond with the full data being requested - the call stack indicates that SDK is blocked waiting on data to be sent from the service. Is that possible?

Jinming-Hu commented 9 months ago

This looks like the service didn't respond with the full data being requested

@LarryOsterman I've never seen this happened before. Is there a default timeout if customer doesn't specify one in context? If there's a timeout, the request will finally fail with an exception, retry policy will kick in or customer can retry by themselves.

My question for @bzhou-sw :

  1. did you also see it hang in other functions other than GetBlockList()?
  2. Can you collect timestamp of the request, URL(account name, container name, blob name) and client request ID, provide these information to us within 48 hours next time you have a repro? This can be helpful for server-side troubleshooting.
Jinming-Hu commented 9 months ago

@bzhou-sw is it possible to share the dump file via email? You can find my email address on my GitHub profile.

bzhou-sw commented 9 months ago

@LarryOsterman We could not reproduce the issue with our test Azure account either. But multiple customers had that on their systems randomly but consistently (means if they restart the process, it will hang eventually, can be in hours, or in couple of days, etc), We are using default parameters (i.e. didn't set any retry policy).

To your questions:

  1. From 3 core dumps we got from the same customer, it is always in GetBlockList()
  2. I don't have customer's account info and no access to their system. The issue can happen in hours or days, it is really hard to collect useful info since they don't monitor their system real time.

@Jinming-Hu I will contact you. Thanks.

github-actions[bot] commented 8 months ago

Hi @bzhou-sw. Thank you for opening this issue and giving us the opportunity to assist. We believe that this has been addressed. If you feel that further discussion is needed, please add a comment with the text "/unresolve" to remove the "issue-addressed" label and continue the conversation.

Jinming-Hu commented 8 months ago

This issue is gone after upgrading to the latest code.

bzhou-sw commented 8 months ago

To be clear, The issue seems gone when we built with commit (which is about 1 year old): Added checks to help diagnose intermittent globalCleanUp test failure (#4593)

I tried to upgrade to latest version, but build failed due to uamqp dependency issue. And I don't know how to solve that.

Really hope SDK team can support build without using vcpkg. At least give some instructions on how to solve the issue.

LarryOsterman commented 8 months ago

If you're acquiring storage from vcpkg, then vcpkg should manage dependencies for you, and you'll not need any additional dependencies.

If you're acquiring storage by using a git submodule, then you need to manage dependencies for storage. That means you'll need: azure-core-cpp[curl, http, winhttp]: curl, vcpkg-cmake, vcpkg-cmake-config, wil azure-core-amqp-cpp: azure-c-shared-utility, azure-core-cpp, azure-macro-utils-c, umock-c, vcpkg-cmake, vcpkg-cmake-config

You may also need a dependency on opentelemetry-cpp as well, but that can be removed if needed.

ronniegeraghty commented 8 months ago

To be clear, The issue seems gone when we built with commit (which is about 1 year old): Added checks to help diagnose intermittent globalCleanUp test failure (#4593)

I tried to upgrade to latest version, but build failed due to uamqp dependency issue. And I don't know how to solve that.

Really hope SDK team can support build without using vcpkg. At least give some instructions on how to solve the issue.

Hi @bzhou-sw, Can you let me know why you are unable to or would like to not use vcpkg for acquiring libraries from the Azure SDK for C++?

github-actions[bot] commented 8 months ago

Hi @bzhou-sw, since you haven’t asked that we /unresolve the issue, we’ll close this out. If you believe further discussion is needed, please add a comment /unresolve to reopen the issue.