Azure / azure-storage-cpp

Microsoft Azure Storage Client Library for C++
http://azure.github.io/azure-storage-cpp
Apache License 2.0
131 stars 147 forks source link

All SDK threads eventually block forever in protocol::parse_extended_error #364

Open kiril-kirov opened 4 years ago

kiril-kirov commented 4 years ago

Hi,

We stumbled upon a very strange situation, where all threads (started by the SDK) eventually will block in parse_extended_error. This happens during DELETE blob operations, when trying to delete non-existing blobs. At the beginning, an appropriate error is returned (that the blob does not exist), but after awhile, all threads block infinitely while trying to parse the errors.

After a lot of digging, I finally have a minimal repro case, which is nothing special - it keep sending 64 parallel DELETE operations for blobs that do not exist.


The Azure SDK that we use is 6.1.0, as 7+ requires GCC 5.1, while we need to use 4.9. Tested on Ubuntu. The test application works fine for some time (usually a few minutes) and at some point everything stops. Responses are never returned to the caller and waiting for the scheduled tasks practically never ends. This doesn't seem to happen in debug builds (or it happens too rarely), only in release builds.

The test app seems to have 42 threads and 40 of them are blocked in the same way. The other two are - the main thread and an io_service thread.


More details: here's a stack trace of a thread, the rest are the same (or at least very similar):

Thread 41 (Thread 0x7f3855491700 (LWP 84472)):
#0  futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f37b0002610) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7f37b00025c0, cond=0x7f37b00025e8) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=0x7f37b00025e8, mutex=0x7f37b00025c0) at pthread_cond_wait.c:638
#3  0x00007f38690eae20 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x000000000063f37b in std::condition_variable::wait<pplx::details::event_impl::wait(unsigned int)::<lambda()> > (__p=..., __lock=..., this=0x7f37b00025e8) at /usr/include/c++/4.9/condition_variable:98
#5  pplx::details::event_impl::wait (timeout=4294967295, this=0x7f37b00025c0) at /home/ubuntu/work/toolset/casablanca/2.10.14/Build/Linux/x86_64/include/pplx/pplxlinux.h:98
#6  pplx::details::event_impl::wait (this=0x7f37b00025c0) at /home/ubuntu/work/toolset/casablanca/2.10.14/Build/Linux/x86_64/include/pplx/pplxlinux.h:113
#7  pplx::details::_TaskCollectionImpl::_Wait (this=0x7f37b00025c0) at /home/ubuntu/work/toolset/casablanca/2.10.14/Build/Linux/x86_64/include/pplx/pplx.h:161
#8  pplx::details::_TaskCollectionImpl::_RunAndWait (this=0x7f37b00025c0) at /home/ubuntu/work/toolset/casablanca/2.10.14/Build/Linux/x86_64/include/pplx/pplx.h:158
#9  pplx::details::_Task_impl_base::_Wait (this=0x7f37b0002560) at /home/ubuntu/work/toolset/casablanca/2.10.14/Build/Linux/x86_64/include/pplx/pplxtasks.h:1777
#10 0x0000000000732adb in pplx::task<unsigned long>::get (this=0x7f38554908a0) at /home/build/toolset/casablanca/2.10.14/Build/Linux/x86_64/include/pplx/pplxtasks.h:3625
#11 azure::storage::core::xml::xml_reader::initialize (this=this@entry=0x7f38554909f0, stream=...) at /home/build/toolset/azure-storage-cpp/6.1.0/Source/Microsoft.WindowsAzure.Storage/src/xmlhelpers.cpp:100
#12 0x000000000073a17e in azure::storage::core::xml::xml_reader::xml_reader (stream=..., this=<optimized out>) at /home/build/toolset/azure-storage-cpp/6.1.0/Source/Microsoft.WindowsAzure.Storage/includes/wascore/xmlhelpers.h:104
#13 azure::storage::protocol::storage_error_reader::storage_error_reader (error_response=..., this=<optimized out>) at /home/build/toolset/azure-storage-cpp/6.1.0/Source/Microsoft.WindowsAzure.Storage/includes/wascore/protocol_xml.h:37
#14 azure::storage::protocol::parse_extended_error (response=...) at /home/build/toolset/azure-storage-cpp/6.1.0/Source/Microsoft.WindowsAzure.Storage/src/response_parsers.cpp:330
#15 0x000000000073c973 in azure::storage::request_result::parse_body (this=this@entry=0x7f3855490ba0, response=...) at /home/build/toolset/azure-storage-cpp/6.1.0/Source/Microsoft.WindowsAzure.Storage/src/request_result.cpp:75
#16 0x000000000073d03b in azure::storage::request_result::request_result (this=0x7f3855490ba0, start_time=..., target_location=<optimized out>, response=..., parse_body_as_error=<optimized out>) at /home/build/toolset/azure-storage-cpp/6.1.0/Source/Microsoft.WindowsAzure.Storage/src/request_result.cpp:37
#17 0x00000000007260e8 in azure::storage::core::executor_impl::<lambda()>::<lambda(pplx::task<web::http::http_response>)>::<lambda(pplx::task<web::http::http_response>)>::operator() (get_error_body_task=..., __closure=0x7f3828031d60) at /home/build/toolset/azure-storage-cpp/6.1.0/Source/Microsoft.WindowsAzure.Storage/src/executor.cpp:212
#18 std::_Function_handler<web::http::http_response(pplx::task<web::http::http_response>), azure::storage::core::executor_impl::execute_async(std::shared_ptr<azure::storage::core::storage_command_base>, const azure::storage::request_options&, azure::storage::operation_context)::<lambda()>::<lambda(pplx::task<web::http::http_response>)>::<lambda(pplx::task<web::http::http_response>)> >::_M_invoke(const std::_Any_data &, pplx::task<web::http::http_response>) (__functor=..., __args#0=...) at /usr/include/c++/4.9/functional:2025
#19 0x00000000007290c3 in std::function<web::http::http_response (pplx::task<web::http::http_response>)>::operator()(pplx::task<web::http::http_response>) const (__args#0=..., this=0x7f3855490ca0) at /usr/include/c++/4.9/functional:2439
#20 pplx::task<web::http::http_response>::_ContinuationTaskHandle<web::http::http_response, web::http::http_response, azure::storage::core::executor_impl::execute_async(std::shared_ptr<azure::storage::core::storage_command_base>, const azure::storage::request_options&, azure::storage::operation_context)::<lambda()>::<lambda(pplx::task<web::http::http_response>)>::<lambda(pplx::task<web::http::http_response>)>, std::integral_constant<bool, true>, pplx::details::_TypeSelectorNoAsync>::_LogWorkItemAndInvokeUserLambda<std::function<web::http::http_response(pplx::task<web::http::http_response>)>, pplx::task<web::http::http_response> > (this=<optimized out>, _value=..., _func=...) at /home/build/toolset/casablanca/2.10.14/Build/Linux/x86_64/include/pplx/pplxtasks.h:3935
#21 pplx::task<web::http::http_response>::_ContinuationTaskHandle<web::http::http_response, web::http::http_response, azure::storage::core::executor_impl::execute_async(std::shared_ptr<azure::storage::core::storage_command_base>, const azure::storage::request_options&, azure::storage::operation_context)::<lambda()>::<lambda(pplx::task<web::http::http_response>)>::<lambda(pplx::task<web::http::http_response>)>, std::integral_constant<bool, true>, pplx::details::_TypeSelectorNoAsync>::_Continue (this=0x1f78b10) at /home/build/toolset/casablanca/2.10.14/Build/Linux/x86_64/include/pplx/pplxtasks.h:4057
#22 pplx::task<web::http::http_response>::_ContinuationTaskHandle<web::http::http_response, web::http::http_response, azure::storage::core::executor_impl::execute_async(std::shared_ptr<azure::storage::core::storage_command_base>, const azure::storage::request_options&, azure::storage::operation_context)::<lambda()>::<lambda(pplx::task<web::http::http_response>)>::<lambda(pplx::task<web::http::http_response>)>, std::integral_constant<bool, true>, pplx::details::_TypeSelectorNoAsync>::_Perform (this=0x1f78b10) at /home/build/toolset/casablanca/2.10.14/Build/Linux/x86_64/include/pplx/pplxtasks.h:3938
#23 pplx::details::_PPLTaskHandle<web::http::http_response, pplx::task<web::http::http_response>::_ContinuationTaskHandle<web::http::http_response, web::http::http_response, azure::storage::core::executor_impl::execute_async(std::shared_ptr<azure::storage::core::storage_command_base>, const azure::storage::request_options&, azure::storage::operation_context)::<lambda()>::<lambda(pplx::task<web::http::http_response>)>::<lambda(pplx::task<web::http::http_response>)>, std::integral_constant<bool, true>, pplx::details::_TypeSelectorNoAsync>, pplx::details::_ContinuationTaskHandleBase>::invoke(void) const (this=0x1f78b10) at /home/build/toolset/casablanca/2.10.14/Build/Linux/x86_64/include/pplx/pplxtasks.h:1645
#24 0x00000000006363bf in pplx::details::_TaskProcHandle::_RunChoreBridge (_Parameter=0x1f78b10) at /home/ubuntu/work/toolset/casablanca/2.10.14/Build/Linux/x86_64/include/pplx/pplx.h:113
#25 0x00000000008323ce in boost::_bi::list1<boost::_bi::value<void*> >::operator()<void (*)(void*), boost::_bi::list0> (f=<synthetic pointer>: <optimized out>, a=<synthetic pointer>..., this=<synthetic pointer>) at /home/build/toolset/boost/1.65.1/Source/boost/bind/bind.hpp:259
#26 boost::_bi::bind_t<void, void (*)(void*), boost::_bi::list1<boost::_bi::value<void*> > >::operator() (this=<synthetic pointer>) at /home/build/toolset/boost/1.65.1/Source/boost/bind/bind.hpp:1294
#27 boost::asio::asio_handler_invoke<boost::_bi::bind_t<void, void (*)(void*), boost::_bi::list1<boost::_bi::value<void*> > > > (function=<synthetic pointer>...) at /home/build/toolset/boost/1.65.1/Source/boost/asio/handler_invoke_hook.hpp:69
#28 boost_asio_handler_invoke_helpers::invoke<boost::_bi::bind_t<void, void (*)(void*), boost::_bi::list1<boost::_bi::value<void*> > >, boost::_bi::bind_t<void, void (*)(void*), boost::_bi::list1<boost::_bi::value<void*> > > > (context=<synthetic pointer>..., function=<synthetic pointer>...) at /home/build/toolset/boost/1.65.1/Source/boost/asio/detail/handler_invoke_helpers.hpp:37
#29 boost::asio::detail::completion_handler<boost::_bi::bind_t<void, void (*)(void*), boost::_bi::list1<boost::_bi::value<void*> > > >::do_complete (owner=0x1ad6a20, base=<optimized out>) at /home/build/toolset/boost/1.65.1/Source/boost/asio/detail/completion_handler.hpp:68
#30 0x0000000000833c03 in boost::asio::detail::task_io_service_operation::complete (bytes_transferred=<optimized out>, ec=..., owner=..., this=0x7f3810012720) at /home/build/toolset/boost/1.65.1/Source/boost/asio/detail/task_io_service_operation.hpp:38
#31 boost::asio::detail::task_io_service::do_run_one (ec=..., this_thread=..., lock=..., this=<optimized out>) at /home/build/toolset/boost/1.65.1/Source/boost/asio/detail/impl/task_io_service.ipp:372
#32 boost::asio::detail::task_io_service::run (this=0x1ad6a20, ec=...) at /home/build/toolset/boost/1.65.1/Source/boost/asio/detail/impl/task_io_service.ipp:149
#33 0x0000000000832e26 in boost::asio::io_service::run (this=0xd57848 <(anonymous namespace)::initialize_shared_threadpool(unsigned long)::uninit_threadpool+8>) at /home/build/toolset/boost/1.65.1/Source/boost/asio/impl/io_service.ipp:59
#34 (anonymous namespace)::threadpool_impl::thread_start (arg=0xd57840 <(anonymous namespace)::initialize_shared_threadpool(unsigned long)::uninit_threadpool>) at /home/build/toolset/casablanca/2.10.14/Source/Release/src/pplx/threadpool.cpp:79
#35 (anonymous namespace)::threadpool_impl::<lambda()>::operator() (__closure=<optimized out>) at /home/build/toolset/casablanca/2.10.14/Source/Release/src/pplx/threadpool.cpp:64
#36 boost::asio::detail::posix_thread::func<(anonymous namespace)::threadpool_impl::add_thread()::<lambda()> >::run(void) (this=<optimized out>) at /home/build/toolset/boost/1.65.1/Source/boost/asio/detail/posix_thread.hpp:82
#37 0x000000000083318f in boost::asio::detail::boost_asio_detail_posix_thread_function (arg=0x1ada620) at /home/build/toolset/boost/1.65.1/Source/boost/asio/detail/impl/posix_thread.ipp:64
#38 0x00007f3869204609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#39 0x00007f3868de0103 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

I've attached:

I thought about pasting a link to the binary (with debug symbols) and a core dump, but the core dump contains my credentials.

Any advice would be very helpful and appreciated.

Thanks, Kiril

Jinming-Hu commented 4 years ago

looks very like a known issue. We've fixed many hang issues since 6.1.0. Can you try to find a test environment where you can reproduce this issue and try to install newer version of GCC and build latest release and see if this issue persists?

kiril-kirov commented 4 years ago

I can try that, although it will be challenging. Any idea about specific fixes, which I can cherry-pick? Do you know if they are specifically in this SDK or in some of the dependencies (like cpprest)?

Even if it works, I'll still need to do something different from just updating the SDK version (because of GCC), that's why I'm asking for specific fixes.

Thanks a lot, Kiril


A bit more details in the meantime - it just looks like an storage_exception occurs and .what() does not contain anything (although it's optimized out and can't see the details in the exception). This leads to:

I'm trying to get something useful from the stream buffers, but GDB has some issues with so many templates.

Jinming-Hu commented 4 years ago

We had hang issues in both cpprestsdk and storage sdk, and I believe most of them (if not all) have been fixed in latest release. If you think it's kind of challenging, you can also try to link 6.1.0 release against latest cpprestsdk and see if it fixes this hang issue.

kiril-kirov commented 4 years ago

I tried using the latest version of cpprest and it hanged again. Will try to use the latest verion of the Azure SDK. I'll keep you updated. Thanks.

kiril-kirov commented 4 years ago

I compiled Casablanca and Azure Storage SDK with GCC 9 on Ubuntu 20.04. The situation is the same - the stacks are a bit different, but they all end up in parse_extended_error again. Now it gets stuck even faster, although it's hard to tell for sure.

The test application is exactly the same, I just changed the build script.


Some additional information:

Dependencies:

 ldd ./test 
        linux-vdso.so.1 (0x00007fff9894e000)
        libazurestorage.so.7 => /home/ubuntu/work/toolset/azure-storage-cpp/7.5.0/Source/Microsoft.WindowsAzure.Storage/build.release/Binaries/libazurestorage.so.7 (0x00007f69f26d7000)
        libcpprest.so.2.10 => /home/ubuntu/work/toolset/casablanca/2.10.16/Source/build.release/Release/Binaries/libcpprest.so.2.10 (0x00007f69f233f000)
        libcrypto.so.1.1 => /lib/x86_64-linux-gnu/libcrypto.so.1.1 (0x00007f69f2050000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f69f202d000)
        libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f69f1e4c000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f69f1e2f000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f69f1c3d000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f69f2dba000)
        libboost_log.so.1.71.0 => /lib/x86_64-linux-gnu/libboost_log.so.1.71.0 (0x00007f69f1b43000)
        libuuid.so.1 => /lib/x86_64-linux-gnu/libuuid.so.1 (0x00007f69f1b3a000)
        libxml2.so.2 => /lib/x86_64-linux-gnu/libxml2.so.2 (0x00007f69f1980000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f69f1831000)
        libssl.so.1.1 => /lib/x86_64-linux-gnu/libssl.so.1.1 (0x00007f69f179c000)
        libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f69f1780000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f69f177a000)
        libboost_filesystem.so.1.71.0 => /lib/x86_64-linux-gnu/libboost_filesystem.so.1.71.0 (0x00007f69f175c000)
        libboost_thread.so.1.71.0 => /lib/x86_64-linux-gnu/libboost_thread.so.1.71.0 (0x00007f69f1730000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f69f1723000)
        libicuuc.so.66 => /lib/x86_64-linux-gnu/libicuuc.so.66 (0x00007f69f153d000)
        liblzma.so.5 => /lib/x86_64-linux-gnu/liblzma.so.5 (0x00007f69f1516000)
        libicudata.so.66 => /lib/x86_64-linux-gnu/libicudata.so.66 (0x00007f69efa55000)

The new build script:

#!/bin/bash

ROOT_PATH="/home/ubuntu/work/toolset"

set -x

g++ -std=gnu++11 -fPIC -Wall -Werror -Wno-sign-compare -Wno-unused-local-typedefs -Werror=unused-variable -pthread -D_FILE_OFFSET_BITS=64 -g -O3 \
    -isystem /home/ubuntu/work/toolset/azure-storage-cpp/7.5.0/Source/Microsoft.WindowsAzure.Storage/includes \
    -isystem /home/ubuntu/work/toolset/casablanca/2.10.16/Source/Release/include \
    -c main.cpp

g++ -fPIC -pthread -rdynamic main.o \
    -L/home/ubuntu/work/toolset/azure-storage-cpp/7.5.0/Source/Microsoft.WindowsAzure.Storage/build.release/Binaries -lazurestorage \
    -L/home/ubuntu/work/toolset/casablanca/2.10.16/Source/build.release/Release/Binaries -lcpprest \
    -lboost_system -lcrypto -lssl -lcpprest -lboost_log -lboost_thread -lxml2 -luuid -lz \
    -ldl -lrt -lpthread -rdynamic \
    -o test

The new stacks are very similar to the ones in 6.1.0 (with 2.10.14), but a bit shorter:

(I patched the CMakeLists.txt in order to add debug symbols, so just added -ggdb3)

#0  futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f72c80071f0) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7f72c80071a0, cond=0x7f72c80071c8) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=0x7f72c80071c8, mutex=0x7f72c80071a0) at pthread_cond_wait.c:638
#3  0x00007f7375f9be20 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x000056221752da03 in std::condition_variable::wait<pplx::details::event_impl::wait(unsigned int)::{lambda()#1}>(std::unique_lock<std::mutex>&, pplx::details::event_impl::wait(unsigned int)::{lambda()#1}) (__p=..., __lock=..., this=0x7f72c80071c8)
    at /home/ubuntu/work/toolset/casablanca/2.10.16/Source/Release/include/pplx/pplxlinux.h:87
#5  pplx::details::event_impl::wait (timeout=4294967295, this=0x7f72c80071a0) at /home/ubuntu/work/toolset/casablanca/2.10.16/Source/Release/include/pplx/pplxlinux.h:87
#6  pplx::details::event_impl::wait (this=0x7f72c80071a0) at /home/ubuntu/work/toolset/casablanca/2.10.16/Source/Release/include/pplx/pplxlinux.h:102
#7  pplx::details::_TaskCollectionImpl::_Wait (this=0x7f72c80071a0) at /home/ubuntu/work/toolset/casablanca/2.10.16/Source/Release/include/pplx/pplx.h:161
#8  pplx::details::_TaskCollectionImpl::_RunAndWait (this=0x7f72c80071a0) at /home/ubuntu/work/toolset/casablanca/2.10.16/Source/Release/include/pplx/pplx.h:158
#9  pplx::details::_Task_impl_base::_Wait (this=0x7f72c8007140) at /home/ubuntu/work/toolset/casablanca/2.10.16/Source/Release/include/pplx/pplxtasks.h:1777
#10 0x00007f7376a03252 in pplx::task<unsigned long>::get (this=0x7f73602a2580) at /usr/local/include/pplx/pplxtasks.h:3618
#11 azure::storage::core::xml::xml_reader::initialize (this=this@entry=0x7f73602a27e0, stream=...) at /home/ubuntu/work/toolset/azure-storage-cpp/7.5.0/Source/Microsoft.WindowsAzure.Storage/src/xmlhelpers.cpp:100
#12 0x00007f7376a17a22 in azure::storage::core::xml::xml_reader::xml_reader (stream=..., this=0x7f73602a27e0) at /usr/include/c++/9/ext/atomicity.h:96
#13 azure::storage::protocol::storage_error_reader::storage_error_reader (error_response=..., this=<optimized out>) at /home/ubuntu/work/toolset/azure-storage-cpp/7.5.0/Source/Microsoft.WindowsAzure.Storage/includes/wascore/protocol_xml.h:37
#14 azure::storage::protocol::parse_extended_error (response=...) at /home/ubuntu/work/toolset/azure-storage-cpp/7.5.0/Source/Microsoft.WindowsAzure.Storage/src/response_parsers.cpp:384
#15 0x00007f7376a1b150 in azure::storage::request_result::parse_body (this=this@entry=0x7f73602a2a20, response=...) at /home/ubuntu/work/toolset/azure-storage-cpp/7.5.0/Source/Microsoft.WindowsAzure.Storage/src/request_result.cpp:78
#16 0x00007f7376a1c06b in azure::storage::request_result::request_result (this=0x7f73602a2a20, start_time=..., target_location=<optimized out>, response=..., parse_body_as_error=<optimized out>)
    at /home/ubuntu/work/toolset/azure-storage-cpp/7.5.0/Source/Microsoft.WindowsAzure.Storage/src/request_result.cpp:38
#17 0x00007f73769d92e4 in azure::storage::core::executor_impl::<lambda()>::<lambda(pplx::task<web::http::http_response>)>::<lambda(pplx::task<web::http::http_response>)>::operator()(pplx::task<web::http::http_response>) const (__closure=0x562218e4bec0, 
    get_error_body_task=...) at /home/ubuntu/work/toolset/azure-storage-cpp/7.5.0/Source/Microsoft.WindowsAzure.Storage/src/executor.cpp:224
#18 0x00007f73769d9593 in std::_Function_handler<web::http::http_response(pplx::task<web::http::http_response>), azure::storage::core::executor_impl::execute_async(std::shared_ptr<azure::storage::core::storage_command_base>, const azure::storage::request_options&, azure::storage::operation_context)::<lambda()>::<lambda(pplx::task<web::http::http_response>)>::<lambda(pplx::task<web::http::http_response>)> >::_M_invoke(const std::_Any_data &, pplx::task<web::http::http_response> &&) (__functor=..., __args#0=...)
    at /usr/include/c++/9/bits/shared_ptr_base.h:1176
#19 0x00007f73769d972a in std::function<web::http::http_response (pplx::task<web::http::http_response>)>::operator()(pplx::task<web::http::http_response>) const (__args#0=..., this=0x7f73602a2c00) at /usr/include/c++/9/bits/std_function.h:683
#20 pplx::task<web::http::http_response>::_ContinuationTaskHandle<web::http::http_response, web::http::http_response, azure::storage::core::executor_impl::execute_async(std::shared_ptr<azure::storage::core::storage_command_base>, const azure::storage::request_options&, azure::storage::operation_context)::<lambda()>::<lambda(pplx::task<web::http::http_response>)>::<lambda(pplx::task<web::http::http_response>)>, std::integral_constant<bool, true>, pplx::details::_TypeSelectorNoAsync>::_LogWorkItemAndInvokeUserLambda<std::function<web::http::http_response(pplx::task<web::http::http_response>)>, pplx::task<web::http::http_response> > (this=0x56221924c5c0, _value=..., _func=...) at /usr/local/include/pplx/pplxtasks.h:3935
#21 pplx::task<web::http::http_response>::_ContinuationTaskHandle<web::http::http_response, web::http::http_response, azure::storage::core::executor_impl::execute_async(std::shared_ptr<azure::storage::core::storage_command_base>, const azure::storage::request_options&, azure::storage::operation_context)::<lambda()>::<lambda(pplx::task<web::http::http_response>)>::<lambda(pplx::task<web::http::http_response>)>, std::integral_constant<bool, true>, pplx::details::_TypeSelectorNoAsync>::_Continue (this=0x56221924c5c0)
    at /usr/local/include/pplx/pplxtasks.h:4057
#22 pplx::task<web::http::http_response>::_ContinuationTaskHandle<web::http::http_response, web::http::http_response, azure::storage::core::executor_impl::execute_async(std::shared_ptr<azure::storage::core::storage_command_base>, const azure::storage::request_options&, azure::storage::operation_context)::<lambda()>::<lambda(pplx::task<web::http::http_response>)>::<lambda(pplx::task<web::http::http_response>)>, std::integral_constant<bool, true>, pplx::details::_TypeSelectorNoAsync>::_Perform (this=0x56221924c5c0)
    at /usr/local/include/pplx/pplxtasks.h:3938
#23 pplx::details::_PPLTaskHandle<web::http::http_response, pplx::task<web::http::http_response>::_ContinuationTaskHandle<web::http::http_response, web::http::http_response, azure::storage::core::executor_impl::execute_async(std::shared_ptr<azure::storage::core::storage_command_base>, const azure::storage::request_options&, azure::storage::operation_context)::<lambda()>::<lambda(pplx::task<web::http::http_response>)>::<lambda(pplx::task<web::http::http_response>)>, std::integral_constant<bool, true>, pplx::details::_TypeSelectorNoAsync>, pplx::details::_ContinuationTaskHandleBase>::invoke(void) const (this=0x56221924c5c0) at /usr/local/include/pplx/pplxtasks.h:1645
#24 0x0000562217523cf4 in pplx::details::_TaskProcHandle::_RunChoreBridge (_Parameter=0x56221924c5c0) at /home/ubuntu/work/toolset/casablanca/2.10.16/Source/Release/include/pplx/pplx.h:113
#25 0x00007f7376651b06 in boost::asio::detail::completion_handler<boost::_bi::bind_t<void, void (*)(void*), boost::_bi::list1<boost::_bi::value<void*> > > >::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long)
    () from /home/ubuntu/work/toolset/casablanca/2.10.16/Source/build.release/Release/Binaries/libcpprest.so.2.10
#26 0x00007f73769d0214 in boost::asio::detail::scheduler_operation::complete (bytes_transferred=<optimized out>, ec=..., owner=0x5622189b61a0, this=<optimized out>) at /usr/include/boost/asio/detail/scheduler_operation.hpp:40
#27 boost::asio::detail::scheduler::do_run_one (ec=..., this_thread=..., lock=..., this=<optimized out>) at /usr/include/boost/asio/detail/impl/scheduler.ipp:447
#28 boost::asio::detail::scheduler::run (this=0x5622189b61a0, ec=...) at /usr/include/boost/asio/detail/impl/scheduler.ipp:200
#29 0x00007f7376652846 in boost::asio::detail::posix_thread::func<(anonymous namespace)::threadpool_impl::add_thread()::{lambda()#1}>::run() () from /home/ubuntu/work/toolset/casablanca/2.10.16/Source/build.release/Release/Binaries/libcpprest.so.2.10
#30 0x00007f73769ca8c4 in boost::asio::detail::boost_asio_detail_posix_thread_function (arg=0x5622189ba860) at /usr/include/boost/asio/detail/impl/posix_thread.ipp:74
#31 0x00007f73760b5609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#32 0x00007f7375dde103 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

I hope this helps.

Jinming-Hu commented 4 years ago

we're going to look into this.

Jinming-Hu commented 4 years ago

I can reproduce in the latest release and I can confirm this is a bug.

Jinming-Hu commented 4 years ago

The root cause of this hang issue is that we are trying to convert an async read operation into a sync operation here, which is a very bad practice in async programming model, but still widely exists in current codebase due to historical reason.

Normally for each async operation we should pass in a continuation function via .then(), so that continuation will get called when the async operation finishes. But by calling .wait() we are actually consuming a thread blocking there for a while until the async operation finishes. If all threads in the thread pool (by default we have 40) happen to be waiting on an async operation, there will be no other threads actually performing the async operations so those async operations will never finish and those threads will be waiting forever, thus the whole program hangs.

Jinming-Hu commented 4 years ago

This bug is not easy to fix and is going to take quite some time. I'm afraid I cannot prioritize this one when we're actively working on the Track2 C++ SDK (which is supposed to replace this Track1 SDK in the future).

But I can come up with some ways to reduce the possibility.

  1. Put more threads into the thread pool, ref https://github.com/Azure/azure-storage-cpp/issues/298
  2. Change your program logic to avoid 4xx HTTP errors as much as possible.

Maybe you can also try our Track2 SDK, which comes with less dependencies and is supposed to be faster and doesn't have these hang issues.

Jinming-Hu commented 4 years ago

This issue is kind of complex, and we want to see if we can figure out some more interactive way to address this issue. You can send me an email if you want so that we can better discuss about it. My corporate email address is on my homepage.

kiril-kirov commented 4 years ago

That's unfortunate that the bug is that complex.

Regarding your advice:

  1. More threads in the pool - will this help? Maybe I don't completely understand the problem, but my concern is - it looks like this .get() here never finishes. Isn't this because the SDK expects answer, which is never returned by the backend? Or it's just because there's no free thread, that will actually process the returned answer and will signal the waiting thread/s? My point is - if it's some logical issue and the SDK wait for an answer, that it will never receive, then the application will eventually block again. I mean, one by one, all threads will be "eaten" by this.

  2. Avoid 4xx HTTP errors - that's gonna be really hard at this point for us, not sure if it's doable at the moment. Unless there's a way to check if such blob exists before trying to delete it? Will this help, if possible?

  3. Unfortunately changing the SDK (with Track2) at this point is too risky for us, especially with version, that's not completed yet. But that's some great news - I didn't know Track2 is under development and it sounds like it will be an amazing update, once finished and tested. Can't wait for it!


Regarding 1.: Actually if the issue is that there's no free thread to process the response (because all threads are blocked in get()), we can think of a way to have more threads than outstanding requests. Will this help, maybe that's what you meant? If so, this will probably be the easiest way out at this point for us.

Yes, we can switch to email correspondence.

Jinming-Hu commented 4 years ago
  1. More threads in the pool - will this help? Maybe I don't completely understand the problem, but my concern is - it looks like this .get() here never finishes. Isn't this because the SDK expects answer, which is never returned by the backend? Or it's just because there's no free thread, that will actually process the returned answer and will signal the waiting thread/s? My point is - if it's some logical issue and the SDK wait for an answer, that it will never receive, then the application will eventually block again. I mean, one by one, all threads will be "eaten" by this.

it's because there is no free thread. One thread is waiting for the read operation to finish, while we need another thread to actually do the read operation. If all those 40 threads happen to be waiting, there won't be another thread do the actual read.

Jinming-Hu commented 4 years ago

2. Avoid 4xx HTTP errors - that's gonna be really hard at this point for us, not sure if it's doable at the moment. Unless there's a way to check if such blob exists before trying to delete it? Will this help, if possible?

One workaround is list operation. But in extreme case, you may need several HTTP requests to list one blob.

kiril-kirov commented 4 years ago

One workaround is list operation. But in extreme case, you may need several HTTP requests to list one blob.

Thanks for the idea, it's still something.

Regarding

it's because there is no free thread. One thread is waiting for the read operation to finish, while we need another thread to actually do the read operation. If all those 40 threads happen to be waiting, there won't be another thread do the actual read.

So, if we schedule 39 parallel requests at max, this will leave one free thread, correct? Will this help avoiding this situation? Also - does this affect any type of requests (not just DELETE) in case of an error? Because I tried GET requests (for non-existing blobs) and it didn't break. Or maybe I didn't wait long enough?

Continuing the discussion here as this could be helpful for somebody else as well. Will switch to email if we need to go into more details, of course.

Thanks for your assistance and fast replies! Appreciated!

Jinming-Hu commented 4 years ago

So, if we schedule 39 parallel requests at max, this will leave one free thread, correct? Will this help avoiding this situation?

theoretically yes. You can also increase thread number of the thread pool.

Also - does this affect any type of requests (not just DELETE) in case of an error? Because I tried GET requests (for non-existing blobs) and it didn't break. Or maybe I didn't wait long enough?

I think all kinds of HTTP methods are affected, as long as the HTTP response status code is 4xx or 5xx.

Jinming-Hu commented 4 years ago

Also - does this affect any type of requests (not just DELETE) in case of an error? Because I tried GET requests (for non-existing blobs) and it didn't break. Or maybe I didn't wait long enough?

Which kind of GET request did you try with? Get blob properties or download blob?

kiril-kirov commented 4 years ago

theoretically yes. You can also increase thread number of the thread pool.

Yes, I was just wondering with how many threads. So, basically the idea is - at least one thread more than the max number of parallel requests.

Which kind of GET request did you try with? Get blob properties or download blob?

Downloading blob. To be more specific: download_range_to_stream_async. It's in the test example, but it's commented out and unused. You can just replace the DeleteBlob with GetBlob in the test example and it should start using GETs. It also uses customer-provided buffer, so it's a bit trickier than the DELETE operation.

Btw, I started the test app from above, just lowering the number of parallel requests down to 39 and it keeps running at the moment. It's running for several minutes (not that long), but it used to block much earlier.

Jinming-Hu commented 4 years ago

theoretically yes. You can also increase thread number of the thread pool.

Yes, I was just wondering with how many threads. So, basically the idea is - at least one thread more than the max number of parallel requests.

One more than max parallelism is enough.

Which kind of GET request did you try with? Get blob properties or download blob?

Downloading blob. To be more specific: download_range_to_stream_async. It's in the test example, but it's commented out and unused. You can just replace the DeleteBlob with GetBlob in the test example and it should start using GETs. It also uses customer-provided buffer, so it's a bit trickier than the DELETE operation.

HTTP Get request with a body won't trigger this hang issue. Try get_properties, which is HTTP GET request without a body.

kiril-kirov commented 4 years ago

Aha, that's really helpful! Thanks! I'll keep you posted. Let's keep this issue open for a bit more.

kiril-kirov commented 4 years ago

We'll go with managing the number of max parallel requests according to the number of threads in the threadpool. Meaning - we'll try to keep one thread free in order to process responses.


Regarding:

HTTP Get request with a body won't trigger this hang issue. Try get_properties, which is HTTP GET request without a body.

Maybe I'm missing something, but I just checked the REST API and GET blob requests don't have a request body, do they? Or we're speaking about a response body instead (judging by the example, you gave)?

Speaking about this, Put blob also don't have a response body, so it could theoretically cause the same issue? Although it's more unlikely, as we don't usually expect tons of errors during PUT.

Jinming-Hu commented 4 years ago

Maybe I'm missing something, but I just checked the REST API and GET blob requests don't have a request body, do they? Or we're speaking about a response body instead (judging by the example, you gave)?

I meant response body.

Speaking about this, Put blob also don't have a response body, so it could theoretically cause the same issue? Although it's more unlikely, as we don't usually expect tons of errors during PUT.

theoretically yes.

Jinming-Hu commented 4 years ago

btw, this bug cannot be fixed in the near future because we're short-handed. As an open source project, any contribution from community is welcomed.

kiril-kirov commented 4 years ago

Speaking about this, Put blob also don't have a response body, so it could theoretically cause the same issue? Although it's more unlikely, as we don't usually expect tons of errors during PUT.

theoretically yes.

Can this happen during normal work? Or a necessary condition is to have error responses?

btw, this bug cannot be fixed in the near future because we're short-handed. As an open source project, any contribution from community is welcomed.

yes, I totally get you point

Jinming-Hu commented 4 years ago

Speaking about this, Put blob also don't have a response body, so it could theoretically cause the same issue? Although it's more unlikely, as we don't usually expect tons of errors during PUT.

theoretically yes.

Can this happen during normal work? Or a necessary condition is to have error responses?

HTTP error status code is a necessary condition.

kiril-kirov commented 4 years ago

One last question, I believe (I'm doing some experiments, but wanted to confirm the answers with you though): can this issue occur if there are 39 parallel DELETE blob requests (given the 40 threads in the thread pool) and some number of parallel GET blob requests at the same time (e.g. 32)?


I mean - we agreed this can't happen if there's at least one 'free' thread, that will be able to process responses. It won't happen if the responses have a HTTP body. So, the 39 DELETEs could theoretically block in case of 4xx HTTP errors as responses don't have bodies. On the other hand we don't expect GET blob to block (due to its body).

So, theoretically (and practically, given my test results), if we use the standard thread pool with 40 threads and:

  1. have at max 39 parallel DELETE blob requests and
  2. random number of parallel GET (download) blob requests (e.g. 32) at the same time

then this is not supposed to happen, correct?

Jinming-Hu commented 4 years ago

One last question, I believe (I'm doing some experiments, but wanted to confirm the answers with you though): can this issue occur if there are 39 parallel DELETE blob requests (given the 40 threads in the thread pool) and some number of parallel GET blob requests at the same time (e.g. 32)?

it won't occur in this case.

I mean - we agreed this can't happen if there's at least one 'free' thread, that will be able to process responses. It won't happen if the responses have a HTTP body. So, the 39 DELETEs could theoretically block in case of 4xx HTTP errors as responses don't have bodies. On the other hand we don't expect GET blob to block (due to its body).

So, theoretically (and practically, given my test results), if we use the standard thread pool with 40 threads and:

  1. have at max 39 parallel DELETE blob requests and
  2. random number of parallel GET (download) blob requests (e.g. 32) at the same time

then this is not supposed to happen, correct?

Exactly correct.

Jinming-Hu commented 4 years ago

@kiril-kirov by the way, since you have a lot of delete operations, if they are in the same storage account, Blob Batch might be helpful.

kiril-kirov commented 4 years ago

Thank you very much for your time, help and ideas!