conda-forge / grpc-cpp-feedstock

A conda-smithy repository for grpc-cpp.
BSD 3-Clause "New" or "Revised" License
2 stars 23 forks source link

grpcio >=1.51 python stack is segfaults on osx (x86-64) #281

Closed s22chan closed 1 month ago

s22chan commented 1 year ago

Solution to issue cannot be found in the documentation.

Issue

On 1.51/2, I get the following making a unary call:

>>> import grpc
>>> import helloworld_pb2
>>> import helloworld_pb2_grpc
>>> with grpc.insecure_channel('localhost:50051') as channel:
        stub = helloworld_pb2_grpc.GreeterStub(channel)
        response = stub.SayHello(helloworld_pb2.HelloRequest(name='you'))

E0221 15:33:42.187235000 140704340240576 client_channel.cc:625]        chand=0x7f8c0182d360: Illegal keepalive throttling value
E0221 15:33:42.187294000 140704340240576 client_channel.cc:625]        chand=0x7f8c0182d360: Illegal keepalive throttling value
E0221 15:33:42.187544000 140704340240576 client_channel.cc:625]        chand=0x7f8c0182d360: Illegal keepalive throttling value
E0221 15:33:42.191049000 123145385762816 client_channel.cc:625]        chand=0x7f8c0182d360: Illegal keepalive throttling value
E0221 15:33:42.191257000 123145385762816 client_channel.cc:625]        chand=0x7f8c0182d360: Illegal keepalive throttling value
E0221 15:33:42.195639000 140704340240576 client_channel.cc:625]        chand=0x7f8c0182d360: Illegal keepalive throttling value

then it randomly can segfault.

Running under lldb shows it's there's some kind of error (linkage to libstdc++?) in abseil (20230125.0):

 thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x1)
    frame #0: 0x0000000104bc7c2e libabsl_cord.2301.0.0.dylib`absl::lts_20230125::CopyCordToString(absl::lts_20230125::Cord const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >*) + 46
libabsl_cord.2301.0.0.dylib`absl::lts_20230125::CopyCordToString:
->  0x104bc7c2e <+46>: movq   (%rax), %r15
    0x104bc7c31 <+49>: movzbl (%rbx), %ecx
    0x104bc7c34 <+52>: movl   %ecx, %eax
    0x104bc7c36 <+54>: andb   $0x1, %al

Installed packages

# Name                    Version                   Build  Channel
bzip2                     1.0.8                h0d85af4_4    conda-forge
c-ares                    1.18.1               h0d85af4_0    conda-forge
ca-certificates           2022.12.7            h033912b_0    conda-forge
grpcio                    1.52.1          py311h814d153_0    conda-forge
libabseil                 20230125.0      cxx17_hf0c8a7f_1    conda-forge
libcxx                    14.0.6               hccf4f1f_0    conda-forge
libffi                    3.4.2                h0d85af4_5    conda-forge
libgrpc                   1.52.1               h493e69f_0    conda-forge
libprotobuf               3.21.12              hbc0c0cd_0    conda-forge
libsqlite                 3.40.0               ha978bb4_0    conda-forge
libzlib                   1.2.13               hfd90126_4    conda-forge
ncurses                   6.3                  h96cf925_1    conda-forge
openssl                   3.0.8                hfd90126_0    conda-forge
pip                       23.0.1             pyhd8ed1ab_0    conda-forge
protobuf                  4.21.12         py311h814d153_0    conda-forge
python                    3.11.0          he7542f4_1_cpython    conda-forge
python_abi                3.11                    3_cp311    conda-forge
re2                       2023.02.01           hf0c8a7f_0    conda-forge
readline                  8.1.2                h3899abd_0    conda-forge
setuptools                67.3.2             pyhd8ed1ab_0    conda-forge
six                       1.16.0             pyh6c4a22f_0    conda-forge
tk                        8.6.12               h5dbffcc_0    conda-forge
tzdata                    2022g                h191b570_0    conda-forge
wheel                     0.38.4             pyhd8ed1ab_0    conda-forge
xz                        5.2.6                h775f41a_0    conda-forge
zlib                      1.2.13               hfd90126_4    conda-forge

Environment info

active environment : test
    active env location : /conda/envs/test
            shell level : 2
       user config file : /.condarc
 populated config files : /.condarc
          conda version : 22.11.1
    conda-build version : 3.23.3
         python version : 3.11.0.final.0
       virtual packages : __archspec=1=x86_64
                          __osx=13.1=0
                          __unix=0=0
       base environment : /conda  (writable)
      conda av data dir : /conda/etc/conda
  conda av metadata url : None
           channel URLs : https://conda.anaconda.org/conda-forge/osx-64
                          https://conda.anaconda.org/conda-forge/noarch
                          https://conda.anaconda.org/bioconda/osx-64
                          https://conda.anaconda.org/bioconda/noarch
          package cache : /conda/pkgs
                          /.conda/pkgs
       envs directories : /conda/envs
                          /.conda/envs
               platform : osx-64
             user-agent : conda/22.11.1 requests/2.28.2 CPython/3.11.0 Darwin/22.2.0 OSX/13.1
                UID:GID : 501:20
             netrc file : None
           offline mode : False
zeroshade commented 1 year ago

I just wanted to chime in a finding here. We ran into this issue with the arrow nightly builds and was able to confirm that gRPC v1.54.0 appears to alleviate this crash.

See https://github.com/apache/arrow/pull/35090 for more info. For now we have to bump back down a version, but if v1.54 can get released on conda-forge arrow would switch to that.

Thanks!

h-vetinari commented 1 year ago

Thanks. Could you try 1.53 as well? 1.54 is pretty fresh and we'd have to either skip 1.53 or migrate twice in a row.

h-vetinari commented 1 year ago

Now that I see that 1.53 broke a version pattern scheme that's held for a while, I might go to 1.54 directly with the next migration...

s22chan commented 1 year ago

Still broken for me (and seems so for arrow as well: https://github.com/apache/arrow/issues/35089)

h-vetinari commented 1 year ago

For now we have to bump back down a version, but if v1.54 can get released on conda-forge arrow would switch to that.

grpc 1.54 is in conda-forge for a while now (there's also 1.55 already, but that will need more time due to breaking changes around protobuf). I'd still be very interested in knowing what broke and how!

s22chan commented 1 year ago

I think the crash is fixed, so this can be closed but requests still aren't being sent correctly.

I'd be interested in this portion of the code, but getting a debug build working of grpc and abseil might be tricky for someone outside these projects (I wasn't able to get abseil cmake to keep the debug symbols):

https://github.com/grpc/grpc/blob/33d0afd316d43ea89c4a5cda680405f2b4033194/src/core/ext/filters/client_channel/client_channel.cc#L650C7-L670

h-vetinari commented 1 year ago

From @lidavidm's work in https://github.com/apache/arrow/issues/36908:

For posterity, to get a debug build of gRPC:

  • git clean -fdx . in the grpc-cpp-feedstock
  • env PATH=... arch -arch x86_64 /bin/bash (clean your $PATH of any gunk)
  • Edit .scripts/run_osx_build.sh and add --keep-old-work to conda mambabuild
  • Edit the recipe to set CMAKE_BUILD_TYPE and also clean CMAKE_BUILD_TYPE out of CMAKE_ARGS (since something in the conda build setup also sets it there)
  • python3 build-locally.py
  • Symlink ln -s .../grpc-cpp-feedstock/miniforge3/conda-bld/grpc-split_1692042959526/work_moved_libgrpc-1.56.2-h162c7d8_0_osx-64/ back to the original path .../grpc-cpp-feedstock/miniforge3/conda-bld/grpc-split_1692042959526/work
  • Install the resulting package, lldb should be able to find debug symbols now

In the meantime, I've started looking at enabling the C++ test suite here as well: https://github.com/conda-forge/grpc-cpp-feedstock/pull/311

s22chan commented 1 year ago

this doesn't generate abseil as debug but I'll take a quick look

s22chan commented 1 year ago

I built both abseil and grpcio with debugging, and it's clear that there is some kind of linkage issue. Somehow the absl::optional<absl::Cord> definitions are different (code patches? different #define?) starting with how grpcio=1.51's build was changed.

abseil is returning a nullopt in various GetPayload places, but grpc is interpreting that as something that has a value.

lldb trace
``` libabsl_status.2301.0.0.dylib`absl::lts_20230125::Status::GetPayload: 0x10384927e <+94>: testb $0x1, %r13b -> 0x103849282 <+98>: jne 0x1038493ba ; <+410> [inlined] absl::lts_20230125::optional_internal::optional_data_dtor_base::optional_data_dtor_base() at optional.h:108:50 Process 32508 stopped * thread #1, queue = 'com.apple.main-thread', stop reason = instruction step into frame #0: 0x0000000103849282 libabsl_status.2301.0.0.dylib`absl::lts_20230125::Status::GetPayload(this=0x0000600000201c38, type_url=Summary Unavailable) const at status.cc:128:7 [opt] Target 0: (python) stopped. (lldb) libabsl_status.2301.0.0.dylib`absl::lts_20230125::Status::GetPayload: -> 0x1038493ba <+410>: movb $0x0, (%r15) 0x1038493be <+414>: movq %r15, %rax 0x1038493c1 <+417>: addq $0x28, %rsp 0x1038493c5 <+421>: popq %rbx 0x1038493c6 <+422>: popq %r12 0x1038493c8 <+424>: popq %r13 0x1038493ca <+426>: popq %r14 0x1038493cc <+428>: popq %r15 0x1038493ce <+430>: popq %rbp 0x1038493cf <+431>: retq 0x1038493d0 <+432>: leaq 0x3825(%rip), %rdi ; "operator()" Process 32508 stopped * thread #1, queue = 'com.apple.main-thread', stop reason = instruction step into frame #0: 0x00000001038493ba libabsl_status.2301.0.0.dylib`absl::lts_20230125::Status::GetPayload(this=, type_url=Summary Unavailable) const at status.cc:131:10 [opt] Target 0: (python) stopped. (lldb) libabsl_status.2301.0.0.dylib`absl::lts_20230125::Status::GetPayload: -> 0x1038493be <+414>: movq %r15, %rax 0x1038493c1 <+417>: addq $0x28, %rsp 0x1038493c5 <+421>: popq %rbx 0x1038493c6 <+422>: popq %r12 0x1038493c8 <+424>: popq %r13 0x1038493ca <+426>: popq %r14 0x1038493cc <+428>: popq %r15 0x1038493ce <+430>: popq %rbp 0x1038493cf <+431>: retq thread #1, queue = 'com.apple.main-thread', stop reason = instruction step into frame #0: 0x00000001038493cf libabsl_status.2301.0.0.dylib`absl::lts_20230125::Status::GetPayload(this=, type_url=Summary Unavailable) const at status.cc:132:1 [opt] Target 0: (python) stopped. (lldb) libgrpc.34.0.0.dylib`grpc_core::ClientChannel::SubchannelWrapper::WatcherWrapper::ApplyUpdateInControlPlaneWorkSerializer: -> 0x1045911d7 <+87>: cmpb $0x0, -0x38(%rbp) 0x1045911db <+91>: je 0x10459130c ; <+396> at new 0x1045911e1 <+97>: leaq -0x68(%rbp), %rdi 0x1045911e5 <+101>: leaq -0x48(%rbp), %rsi 0x1045911e9 <+105>: callq 0x1048addd2 ; symbol stub for: absl::lts_20230125::Cord::operator std::__1::basic_string, std::__1::allocator >() const 0x1045911ee <+110>: movzbl -0x68(%rbp), %esi 0x1045911f2 <+114>: testb $0x1, %sil Process 32508 stopped * thread #1, queue = 'com.apple.main-thread', stop reason = instruction step into frame #0: 0x00000001045911d7 libgrpc.34.0.0.dylib`grpc_core::ClientChannel::SubchannelWrapper::WatcherWrapper::ApplyUpdateInControlPlaneWorkSerializer(this=0x000060000020c400, state=GRPC_CHANNEL_IDLE, status=0x0000600000201c38) at client_channel.cc:654:32 [opt] (lldb) p keepalive_throttling (std::optional) $4 = Has Value=true { Value = { contents_ = { data_ = { rep_ = { = { data ={...} as_tree ={...} } } } } } } ```
h-vetinari commented 1 year ago

Thanks for digging into this!

starting with how grpcio=1.51's build was changed.

So grpc-cpp=1.51.1=_0 only bumped the version, but shortly after we rebuilt for a new re2 (build _1) and more importantly: the newest abseil at the time (build *_2).

It's possible that grpc was using abseil in a way that wasn't compatible with that newer version, however, they definitely followed suit as of grpc 1.53 (which doesn't look like it needed source changes).

Which grpc & abseil version did you use for your debug builds?

s22chan commented 1 year ago

in this case grcp-cpp-feedstock 2e04d51 (1.57.0) and abseil-cpp-feedstock 6117048 (20230125.3).

But this issue has persisted since 1.51 as per the title. I'd dig in more but grcp takes forever to build.

s22chan commented 1 year ago

step-by-step repro for posterity (note: -DCMAKE_BUILD_TYPE=Debug will not show the bug):

grpc:

diff --git a/recipe/build-cpp.sh b/recipe/build-cpp.sh
index 51b2f3f..7a137d9 100755
--- a/recipe/build-cpp.sh
+++ b/recipe/build-cpp.sh
@@ -60,6 +60,7 @@ cmake ${CMAKE_ARGS} ..  \
       -GNinja \
       -DBUILD_SHARED_LIBS=ON \
       -DCMAKE_BUILD_TYPE=Release \
+      -DCMAKE_CXX_FLAGS_RELEASE="${CMAKE_CXX_FLAGS_RELEASE:-} -O1 -g -DNDEBUG" \
       -DCMAKE_CXX_FLAGS="$CXXFLAGS" \
       -DCMAKE_PREFIX_PATH=$PREFIX \
       -DCMAKE_INSTALL_PREFIX=$PREFIX \

abseil:

iff --git a/recipe/build-abseil.sh b/recipe/build-abseil.sh
index a1533e0..3fdc43f 100644
--- a/recipe/build-abseil.sh
+++ b/recipe/build-abseil.sh
@@ -24,6 +24,7 @@ fi
 cmake -G Ninja \
     ${CMAKE_ARGS} \
     -DCMAKE_BUILD_TYPE=Release \
+    -DCMAKE_CXX_FLAGS_RELEASE="${CMAKE_CXX_FLAGS_RELEASE:-} -O1 -g -DNDEBUG" \
     -DCMAKE_CXX_STANDARD=17 \
     -DCMAKE_INSTALL_LIBDIR=lib \
     -DCMAKE_PREFIX_PATH=${PREFIX} \
  1. build both with EXTRA_CB_OPTIONS="--dirty --no-test" OSX_SDK_DIR=... ./build-locally.py. This doesn't use the abseil local build but -c doesn't seem to work for mambabuild.
  2. mamba create -n test grpcio protobuf --override-channels -c /.../grpc-cpp-feedstock/miniforge3/conda-bld -c /.../abseil-cpp-feedstock/miniforge3/conda-bld -c conda-forge
  3. mamba activate test
  4. lldb python test.py note: lldb will complain that conda-forge touched the libabseil .o files and won't load symbols. I was able to get around, that but I forget now what I did.
h-vetinari commented 1 year ago

I tried to add a test for OP example in #312, but either I cannot get the server process to work correctly, our CI setup doesn't allow accessing accessing the default port of 127.0.0.1. On all platforms, it runs into a variant of

grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
    status = StatusCode.UNKNOWN
    details = ""
    debug_error_string = "UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:50051:  {grpc_status:14, created_time:"2023-08-28T09:34:02.207699+00:00"}"
>

Interestingly, despite the failing startup, the segfault seems to happen on osx.

s22chan commented 1 year ago

you don't need a server, since the segfault happens due to the connection state being updated in callbacks for the client (eg IDLE -> CONNECTING)

h-vetinari commented 1 year ago

I had tried it without the server, and it gets the same error (on linux). If we are to integrate this into CI (best way to fix it and keep it fixed), we're going to need a form of the test that passes.

s22chan commented 1 year ago

I tried to add a test for OP example in #312, but either I cannot get the server process to work correctly, our CI setup doesn't allow accessing accessing the default port of 127.0.0.1. On all platforms, it runs into a variant of

grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
  status = StatusCode.UNKNOWN
  details = ""
  debug_error_string = "UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:50051:  {grpc_status:14, created_time:"2023-08-28T09:34:02.207699+00:00"}"
>

Interestingly, despite the failing startup, the segfault seems to happen on osx.

I don't think the server is started? the https://github.com/conda-forge/grpc-cpp-feedstock/pull/312/files#diff-ddad1f9c7894b9921ef910104dd72b393dcb3d052923fdc85a5f0eaad8bdb450 differs significantly from https://github.com/grpc/grpc/blob/master/examples/python/helloworld/greeter_server.py

also while the crash is non-deterministic, the keepalive error is 100% deterministic and you can just run it once and verify no warnings/errors were emitted.

h-vetinari commented 1 year ago

Good point, thanks. I had just stupidly copied stuff together from their docs. Updated the PR, let's see how it goes.

s22chan commented 1 year ago

@h-vetinari I saw you tried to downgrade abseil to test if it failed, have you tried new abseil + old (1.50) grpc?

h-vetinari commented 1 year ago

have you tried new abseil + old (1.50) grpc?

313 for now does 1.50 as published. We can then try to bump abseil and see what happens.

h-vetinari commented 1 year ago

Small update from #313: Test works with 1.50 & old abseil (20220623.3), fails on 1.50 with newer abseil (20230125; which our builds since 1.51 are built against), and passes again with newest abseil (20230802; not yet migrated)

h-vetinari commented 1 year ago

Alright, the good news is that both on 1.50 as well as on 1.56, the newest abseil (20230802) fixes the error.

The bad news is that the migration for this is stuck in purgatory until we figure out how to move to a Macos>=10.13 world

s22chan commented 1 year ago

thanks for the investigation @h-vetinari. I'd be suspicious if 20230125 actually ever worked with OSX<10.13 at this point... although I didn't see any C++ABI code in abseil's optional code

h-vetinari commented 1 year ago

Do you think it has something to do with having compiled abseil 20230125 against 10.9? That possibility did not occur to me, I would have guessed it's an abseil bug, but perhaps you're right. In that case, we could fix all the existing grpc builds by just recompiling abseil against 10.13, which would be very nice...

s22chan commented 1 year ago

That was my initial thought before debugging (since that was the big change other than the version bump), but the optional code didn't have/fallback on libc++, so not sure if it's worth your time.

h-vetinari commented 1 month ago

Closing this as fixed by #315