Occasional segfaults in event_loop teardown due to dead logger

wence- commented 11 months ago

Describe the bug

I am using distributed and pyarrow, which links to this SDK for s3 file access.

Using version 1.11.156, I see (intermittently, so a race condition of some kind) a segfault in the "EvntLoopCleanup" thread that AWS-SDK sets up.

See backtrace below.

Expected Behavior

No segfault

Current Behavior

Backtrace:

(gdb) info thread
  Id   Target Id                                           Frame 
  1    Thread 0x7ffff7fe8740 (LWP 54918) "python"          _dl_close_worker (map=map@entry=0x5555c05d20a0, 
    force=force@entry=false) at dl-close.c:794
  3    Thread 0x7fff809ff700 (LWP 54931) "jemalloc_bg_thd" 0x00007ffff7bc2065 in futex_abstimed_wait_cancelable (
    private=<optimized out>, abstime=0x7fff809fed70, expected=0, futex_word=0x7fff81016634)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:205
  5    Thread 0x7fff15d9e700 (LWP 54934) "cuda-EvtHandlr"  0x00007ffff7132bb9 in __GI___poll (fds=0x55555b07c900, nfds=2, 
    timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
  10   Thread 0x7fff1559d700 (LWP 54941) "AwsEventLoop 1"  0x00007ffff713f947 in epoll_wait (epfd=37, events=0x7fff1559c950, 
    maxevents=100, timeout=100000) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
  11   Thread 0x7ffed9cd2700 (LWP 54942) "cuda-EvtHandlr"  0x00007ffff7132bb9 in __GI___poll (fds=0x7ffed4000bd0, nfds=10, 
    timeout=100) at ../sysdeps/unix/sysv/linux/poll.c:29
  12   Thread 0x7ffed9270700 (LWP 54943) "python"          0x00007ffff7bc2065 in futex_abstimed_wait_cancelable (
    private=<optimized out>, abstime=0x7ffed926fe80, expected=0, futex_word=0x55555b156768)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:205
  55   Thread 0x7ffe91dda700 (LWP 54988) "jemalloc_bg_thd" 0x00007ffff7bc1ad3 in futex_wait_cancelable (
    private=<optimized out>, expected=0, futex_word=0x7fff81016704) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  57   Thread 0x7ffe8f728700 (LWP 54991) "jemalloc_bg_thd" 0x00007ffff7bc1ad3 in futex_wait_cancelable (
    private=<optimized out>, expected=0, futex_word=0x7fff810167d4) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
* 84   Thread 0x7fff8eb84700 (LWP 55056) "EvntLoopCleanup" 0x00007fff82829b90 in Aws::Utils::Logging::s_aws_logger_redirect_get_log_level (logger=0x7fff8290b2f0 <Aws::Utils::Logging::s_sdkCrtLogger>, subject=1025)
    at /usr/local/src/conda/aws-sdk-cpp-1.11.156/src/aws-cpp-sdk-core/source/utils/logging/CRTLogging.cpp:73

(gdb) bt
#0  0x00007fff82829b90 in Aws::Utils::Logging::s_aws_logger_redirect_get_log_level (
    logger=0x7fff8290b2f0 <Aws::Utils::Logging::s_sdkCrtLogger>, subject=1025)
    at /usr/local/src/conda/aws-sdk-cpp-1.11.156/src/aws-cpp-sdk-core/source/utils/logging/CRTLogging.cpp:73
#1  0x00007fff81e2e7d3 in s_destroy ()
   from /opt/conda/envs/dask/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-io.so.1.0.0
#2  0x00007fff81e28786 in s_aws_event_loop_group_shutdown_sync ()
   from /opt/conda/envs/dask/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-io.so.1.0.0
#3  0x00007fff81e287d9 in s_event_loop_destroy_async_thread_fn ()
   from /opt/conda/envs/dask/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-io.so.1.0.0
#4  0x00007fff81fa9359 in thread_fn ()
   from /opt/conda/envs/dask/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1
#5  0x00007ffff7bbb6db in start_thread (arg=0x7fff8eb84700) at pthread_create.c:463
#6  0x00007ffff713f61f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb) p *CRTLogSystem._M_ptr
$8 = {_vptr.CRTLogSystemInterface = 0x0}

So it looks like in attempting to log something, the (static?) CRTLogSystem has already been torn down.

(I can attempt to build libaws-c-io with symbols as well to get more information).

Reproduction Steps

This is not that easy (I have not yet managed to minimise at all sorry!). Needs an NVIDIA gpu and a cuda toolkit in the host environment + nvidia container toolkit.

$ git clone https://github.com/dask/distributed
$ docker run --rm -it --gpus device=1 --network=host --volume $PWD:/workspace --workdir /workspace --entrypoint /bin/bash docker.io/gpuci/distributed:23.10-cuda11.5-devel-ubuntu18.04-py3.9
# in container
$ export PATH=/opt/conda/bin:/usr/local/cuda/bin:$PATH
$ export PARALLEL_LEVEL=${PARALLEL_LEVEL:-4}
$ cd /workspace
$ export DASK_DISTRIBUTED__ADMIN__SYSTEM_MONITOR__GIL__ENABLED=False
$ . /opt/conda/etc/profile.d/conda.sh
$ conda activate dask
$ python -m pip install -e .
$ python -m pip install git+https://github.com/dask/dask
# this might need to be run a few times
$ pytest distributed -v -m gpu --runslow -k 'ping_pong or transpose'

Possible Solution

No response

Additional Information/Context

SDK version 1.10.x doesn't exhibit this issue.

AWS CPP SDK version used

1.11.156

Compiler and Version used

g++ (conda-forge gcc 12.3.0-2) 12.3.0

Operating System and version

Ubuntu 18.04

jmklix commented 11 months ago

Can you reproduce this on something other than a NIVDIA gpu? If so can you provide the detailed reproduction steps for that?

wence- commented 11 months ago

Can you reproduce this on something other than a NIVDIA gpu? If so can you provide the detailed reproduction steps for that?

I will give it a go and get back to you (it may take a little while), thanks!

pentschev commented 10 months ago

@wence- has been busy with other tasks, but I finally succeeded in coming up with a reproducer that doesn't require GPUs, I'm not entirely sure the bug is the same as the stack I see is somewhat different from that in the original description:

(gdb) bt
#0  aws_mem_release (allocator=0x55c315d103b9, ptr=0x55c649b598d0) at /home/mambauser/aws-sdk-cpp/crt/aws-crt-cpp/crt/aws-c-common/source/allocator.c:210
#1  0x00007fbbf0accc55 in Aws::Crt::StlAllocator<char>::deallocate (this=0x55c649b598a0, this=0x55c649b598a0, p=<optimized out>)
    at /home/mambauser/aws-sdk-cpp/crt/aws-crt-cpp/include/aws/crt/StlAllocator.h:56
#2  std::allocator_traits<Aws::Crt::StlAllocator<char> >::deallocate (__n=<optimized out>, __p=<optimized out>, __a=...)
    at /opt/conda/envs/aws-segfault/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/alloc_traits.h:345
#3  std::__cxx11::basic_string<char, std::char_traits<char>, Aws::Crt::StlAllocator<char> >::_M_destroy (__size=<optimized out>, this=0x55c649b598a0)
    at /opt/conda/envs/aws-segfault/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/basic_string.h:245
#4  std::__cxx11::basic_string<char, std::char_traits<char>, Aws::Crt::StlAllocator<char> >::_M_dispose (this=0x55c649b598a0)
    at /opt/conda/envs/aws-segfault/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/basic_string.h:240
#5  std::__cxx11::basic_string<char, std::char_traits<char>, Aws::Crt::StlAllocator<char> >::_M_dispose (this=0x55c649b598a0)
    at /opt/conda/envs/aws-segfault/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/basic_string.h:237
#6  std::__cxx11::basic_string<char, std::char_traits<char>, Aws::Crt::StlAllocator<char> >::~basic_string (this=0x55c649b598a0, __in_chrg=<optimized out>)
    at /opt/conda/envs/aws-segfault/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/basic_string.h:672
#7  std::default_delete<std::__cxx11::basic_string<char, std::char_traits<char>, Aws::Crt::StlAllocator<char> > >::operator() (this=<optimized out>,
    __ptr=0x55c649b598a0) at /opt/conda/envs/aws-segfault/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/unique_ptr.h:85
#8  std::unique_ptr<std::__cxx11::basic_string<char, std::char_traits<char>, Aws::Crt::StlAllocator<char> >, std::default_delete<std::__cxx11::basic_string<char, std::char_traits<char>, Aws::Crt::StlAllocator<char> > > >::~unique_ptr (this=<optimized out>, __in_chrg=<optimized out>)
    at /opt/conda/envs/aws-segfault/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/unique_ptr.h:361
#9  0x00007fbe99af3495 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#10 0x00007fbe99af3610 in exit () from /lib/x86_64-linux-gnu/libc.so.6
#11 0x00007fbe99ad7d97 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#12 0x00007fbe99ad7e40 in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6
#13 0x000055c646edb421 in _start ()

To run the reproducer a system with Docker available is required, first save the snippet below named aws-segfault.dockerfile:

aws-segfault.dockerfile

```dockerfile FROM mambaorg/micromamba:jammy # Install runtime, build and debug dependencies RUN micromamba create -c conda-forge -n aws-segfault \ python=3.10 \ cuda-version=11.8 \ pyarrow=12.0.1 \ aws-sdk-cpp=1.11.156 \ pytest \ pytest-cov \ pytest-timeout \ cmake \ c-compiler \ cxx-compiler \ gcc_linux-64=11.* \ make \ gdb \ git SHELL ["micromamba", "run", "-n", "aws-segfault", "/bin/bash", "-c"] # Install distributed from source in developer mode RUN cd ${HOME} \ && git clone https://github.com/dask/distributed \ && cd distributed \ && git checkout 2023.9.2 \ && pip install -e . # Build and install aws-sdk-cpp from source with debug config # Note: we must unset SCCACHE_BUCKET that is predefined by the base image RUN unset SCCACHE_BUCKET \ && cd ${HOME} \ && git clone --recurse-submodules https://github.com/aws/aws-sdk-cpp \ && cd aws-sdk-cpp \ && git checkout 1.11.181 \ && mkdir build \ && cd build \ && cmake .. \ -DCMAKE_BUILD_TYPE=Debug \ -DCMAKE_PREFIX_PATH=$CONDA_PREFIX \ -DCMAKE_INSTALL_PREFIX=$CONDA_PREFIX \ -DBUILD_ONLY="s3" \ -DAUTORUN_UNIT_TESTS=OFF \ -DENABLE_TESTING=OFF \ && cmake --build . --config=Debug \ && cmake --install . --config=Debug COPY test_aws_segfault.py /home/mambauser/distributed/test_aws_segfault.py WORKDIR /home/mambauser/distributed ```

And save the following named test_aws_segfault.py to the same local directory:

test_aws_segfault.py

```python import pytest pytestmark = pytest.mark.gpu from distributed import Client from distributed.deploy.local import LocalCluster from distributed.utils_test import gen_test @gen_test() async def test_aws_segfault(): from pyarrow.fs import FSSpecHandler, PyFileSystem da = pytest.importorskip("dask.array") async with LocalCluster( protocol="tcp", n_workers=2, threads_per_worker=2, dashboard_address=":0", asynchronous=True ) as cluster: async with Client(cluster, asynchronous=True): assert cluster.scheduler_address.startswith("tcp://") x = da.ones((10000, 10000), chunks=(1000, 1000)).persist() await x y = (x + x.T).sum() await y ```

With both files present in the local directory, build the docker image:

docker build -t aws-segfault:latest -f aws-segfault.dockerfile .

Then start the container:

docker run --rm -ti aws-segfault:latest /bin/bash

Inside the container run:

micromamba activate aws-segfault
python -m pytest -vs test_aws_segfault.py
# Or run with gdb
# gdb --args python -m pytest -vs test_aws_segfault.py
# run

With the instructions above I consistently got a segfault and didn't need to run multiple times to reproduce.

jmklix commented 5 months ago

This should be fixed with these two PR's:

https://github.com/aws/aws-sdk-cpp/pull/2859
https://github.com/aws/aws-sdk-cpp/pull/2885 Please let us know if you're still running into this error.

github-actions[bot] commented 5 months ago

This issue is now closed. Comments on closed issues are hard for our team to see. If you need more assistance, please open a new issue that references this one.

wence- commented 5 months ago

Thanks, we'll keep an eye out.

aws / aws-sdk-cpp