Closed s22chan closed 1 month ago
I just wanted to chime in a finding here. We ran into this issue with the arrow nightly builds and was able to confirm that gRPC v1.54.0 appears to alleviate this crash.
See https://github.com/apache/arrow/pull/35090 for more info. For now we have to bump back down a version, but if v1.54 can get released on conda-forge arrow would switch to that.
Thanks!
Thanks. Could you try 1.53 as well? 1.54 is pretty fresh and we'd have to either skip 1.53 or migrate twice in a row.
Now that I see that 1.53 broke a version pattern scheme that's held for a while, I might go to 1.54 directly with the next migration...
Still broken for me (and seems so for arrow as well: https://github.com/apache/arrow/issues/35089)
For now we have to bump back down a version, but if v1.54 can get released on conda-forge arrow would switch to that.
grpc 1.54 is in conda-forge for a while now (there's also 1.55 already, but that will need more time due to breaking changes around protobuf). I'd still be very interested in knowing what broke and how!
I think the crash is fixed, so this can be closed but requests still aren't being sent correctly.
I'd be interested in this portion of the code, but getting a debug build working of grpc and abseil might be tricky for someone outside these projects (I wasn't able to get abseil cmake to keep the debug symbols):
From @lidavidm's work in https://github.com/apache/arrow/issues/36908:
For posterity, to get a debug build of gRPC:
git clean -fdx .
in the grpc-cpp-feedstockenv PATH=... arch -arch x86_64 /bin/bash
(clean your $PATH of any gunk)- Edit
.scripts/run_osx_build.sh
and add--keep-old-work
toconda mambabuild
- Edit the recipe to set CMAKE_BUILD_TYPE and also clean CMAKE_BUILD_TYPE out of CMAKE_ARGS (since something in the conda build setup also sets it there)
python3 build-locally.py
- Symlink
ln -s .../grpc-cpp-feedstock/miniforge3/conda-bld/grpc-split_1692042959526/work_moved_libgrpc-1.56.2-h162c7d8_0_osx-64/
back to the original path.../grpc-cpp-feedstock/miniforge3/conda-bld/grpc-split_1692042959526/work
- Install the resulting package, lldb should be able to find debug symbols now
In the meantime, I've started looking at enabling the C++ test suite here as well: https://github.com/conda-forge/grpc-cpp-feedstock/pull/311
this doesn't generate abseil as debug but I'll take a quick look
I built both abseil
and grpcio
with debugging, and it's clear that there is some kind of linkage issue. Somehow the absl::optional<absl::Cord>
definitions are different (code patches? different #define
?) starting with how grpcio=1.51
's build was changed.
abseil
is returning a nullopt
in various GetPayload
places, but grpc
is interpreting that as something that has a value.
Thanks for digging into this!
starting with how
grpcio=1.51
's build was changed.
So grpc-cpp=1.51.1=_0 only bumped the version, but shortly after we rebuilt for a new re2 (build _1) and more importantly: the newest abseil at the time (build *_2).
It's possible that grpc was using abseil in a way that wasn't compatible with that newer version, however, they definitely followed suit as of grpc 1.53 (which doesn't look like it needed source changes).
Which grpc & abseil version did you use for your debug builds?
in this case grcp-cpp-feedstock 2e04d51
(1.57.0) and abseil-cpp-feedstock 6117048
(20230125.3).
But this issue has persisted since 1.51 as per the title. I'd dig in more but grcp takes forever to build.
step-by-step repro for posterity (note: -DCMAKE_BUILD_TYPE=Debug
will not show the bug):
grpc:
diff --git a/recipe/build-cpp.sh b/recipe/build-cpp.sh
index 51b2f3f..7a137d9 100755
--- a/recipe/build-cpp.sh
+++ b/recipe/build-cpp.sh
@@ -60,6 +60,7 @@ cmake ${CMAKE_ARGS} .. \
-GNinja \
-DBUILD_SHARED_LIBS=ON \
-DCMAKE_BUILD_TYPE=Release \
+ -DCMAKE_CXX_FLAGS_RELEASE="${CMAKE_CXX_FLAGS_RELEASE:-} -O1 -g -DNDEBUG" \
-DCMAKE_CXX_FLAGS="$CXXFLAGS" \
-DCMAKE_PREFIX_PATH=$PREFIX \
-DCMAKE_INSTALL_PREFIX=$PREFIX \
abseil:
iff --git a/recipe/build-abseil.sh b/recipe/build-abseil.sh
index a1533e0..3fdc43f 100644
--- a/recipe/build-abseil.sh
+++ b/recipe/build-abseil.sh
@@ -24,6 +24,7 @@ fi
cmake -G Ninja \
${CMAKE_ARGS} \
-DCMAKE_BUILD_TYPE=Release \
+ -DCMAKE_CXX_FLAGS_RELEASE="${CMAKE_CXX_FLAGS_RELEASE:-} -O1 -g -DNDEBUG" \
-DCMAKE_CXX_STANDARD=17 \
-DCMAKE_INSTALL_LIBDIR=lib \
-DCMAKE_PREFIX_PATH=${PREFIX} \
EXTRA_CB_OPTIONS="--dirty --no-test" OSX_SDK_DIR=... ./build-locally.py
. This doesn't use the abseil local build but -c
doesn't seem to work for mambabuild.mamba create -n test grpcio protobuf --override-channels -c /.../grpc-cpp-feedstock/miniforge3/conda-bld -c /.../abseil-cpp-feedstock/miniforge3/conda-bld -c conda-forge
mamba activate test
lldb python test.py
note: lldb will complain that conda-forge touched the libabseil .o
files and won't load symbols. I was able to get around, that but I forget now what I did.I tried to add a test for OP example in #312, but either I cannot get the server process to work correctly, our CI setup doesn't allow accessing accessing the default port of 127.0.0.1. On all platforms, it runs into a variant of
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNKNOWN
details = ""
debug_error_string = "UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:50051: {grpc_status:14, created_time:"2023-08-28T09:34:02.207699+00:00"}"
>
Interestingly, despite the failing startup, the segfault seems to happen on osx.
you don't need a server, since the segfault happens due to the connection state being updated in callbacks for the client (eg IDLE -> CONNECTING)
I had tried it without the server, and it gets the same error (on linux). If we are to integrate this into CI (best way to fix it and keep it fixed), we're going to need a form of the test that passes.
I tried to add a test for OP example in #312, but either I cannot get the server process to work correctly, our CI setup doesn't allow accessing accessing the default port of 127.0.0.1. On all platforms, it runs into a variant of
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with: status = StatusCode.UNKNOWN details = "" debug_error_string = "UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:50051: {grpc_status:14, created_time:"2023-08-28T09:34:02.207699+00:00"}" >
Interestingly, despite the failing startup, the segfault seems to happen on osx.
I don't think the server is started? the https://github.com/conda-forge/grpc-cpp-feedstock/pull/312/files#diff-ddad1f9c7894b9921ef910104dd72b393dcb3d052923fdc85a5f0eaad8bdb450 differs significantly from https://github.com/grpc/grpc/blob/master/examples/python/helloworld/greeter_server.py
also while the crash is non-deterministic, the keepalive error is 100% deterministic and you can just run it once and verify no warnings/errors were emitted.
Good point, thanks. I had just stupidly copied stuff together from their docs. Updated the PR, let's see how it goes.
@h-vetinari I saw you tried to downgrade abseil to test if it failed, have you tried new abseil + old (1.50) grpc?
have you tried new abseil + old (1.50) grpc?
Small update from #313: Test works with 1.50 & old abseil (20220623.3), fails on 1.50 with newer abseil (20230125; which our builds since 1.51 are built against), and passes again with newest abseil (20230802; not yet migrated)
Alright, the good news is that both on 1.50 as well as on 1.56, the newest abseil (20230802) fixes the error.
The bad news is that the migration for this is stuck in purgatory until we figure out how to move to a Macos>=10.13 world
thanks for the investigation @h-vetinari. I'd be suspicious if 20230125
actually ever worked with OSX<10.13 at this point... although I didn't see any C++ABI code in abseil's optional code
Do you think it has something to do with having compiled abseil 20230125 against 10.9? That possibility did not occur to me, I would have guessed it's an abseil bug, but perhaps you're right. In that case, we could fix all the existing grpc builds by just recompiling abseil against 10.13, which would be very nice...
That was my initial thought before debugging (since that was the big change other than the version bump), but the optional code didn't have/fallback on libc++, so not sure if it's worth your time.
Closing this as fixed by #315
Solution to issue cannot be found in the documentation.
Issue
On 1.51/2, I get the following making a unary call:
then it randomly can segfault.
Running under lldb shows it's there's some kind of error (linkage to libstdc++?) in abseil (20230125.0):
Installed packages
Environment info