Open njzjz opened 5 months ago
Do you have more information about the calling code? We even have a specific flags test here that runs on every CI run. It's also one of the most-used pieces of abseil, so it must be some very weird corner case for this to not show up in many more places.
It just uses dlopen to load a library, i.e.
dlopen(fname.c_str(), RTLD_NOW | RTLD_GLOBAL);
The library here is linked against the abseil, tensorflow, and pytorch. It seems no other specific codes are called.
I found it can be reproduced by a simple example:
test_cc.cc:
#include "tensorflow/core/public/session.h"
#include "tensorflow/core/platform/env.h"
#include "tensorflow/core/framework/op.h"
#include "tensorflow/core/framework/op_kernel.h"
#include "tensorflow/core/framework/shape_inference.h"
#include "tensorflow/cc/client/client_session.h"
#include "tensorflow/cc/ops/standard_ops.h"
#include "tensorflow/core/framework/tensor.h"
#include "torch/script.h"
int main() {
using namespace tensorflow;
using namespace tensorflow::ops;
Scope root = Scope::NewRootScope();
// Matrix A = [3 2; -1 0]
auto A = Const(root, { {3.f, 2.f}, {-1.f, 0.f} });
// Vector b = [3 5]
auto b = Const(root, { {3.f, 5.f} });
// v = Ab^T
auto v = MatMul(root.WithOpName("v"), A, b, MatMul::TransposeB(true));
std::vector<Tensor> outputs;
ClientSession session(root);
// Run and fetch v
TF_CHECK_OK(session.Run({v}, &outputs));
// Expect outputs[0] == [19; -3]
LOG(INFO) << outputs[0].matrix<float>();
torch::Tensor tensor = torch::eye(3);
return 0;
}
compile.sh:
#!/bin/bash
set -exuo pipefail
# This environment installs the latest TensorFlow and PyTorch
PREFIX=~/anaconda3/envs/tfpt
g++ -o test_cc test_cc.cc -ltensorflow_cc -ltensorflow_framework -ltorch -ltorch_cpu -lc10 -labsl_status -I${PREFIX}/include -L${PREFIX}/lib -Wl,-rpath ${PREFIX}/lib
./test_cc
Got:
+ ./test_cc
./compile.sh: line 7: 1135726 Segmentation fault (core dumped) ./test_cc
GDB:
(gdb) r
Starting program: /home/jz748/tmp/testtfpt/test_cc
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Program received signal SIGSEGV, Segmentation fault.
0x0000155507bc4e09 in absl::lts_20240116::flags_internal::FlagRegistry::RegisterFlag(absl::lts_20240116::CommandLineFlag&, char const*) ()
from /home/jz748/anaconda3/envs/tfpt/lib/libabsl_flags_reflection.so.2401.0.0
Missing separate debuginfos, use: dnf debuginfo-install glibc-2.37-18.fc38.x86_64 xorg-x11-drv-nvidia-cuda-libs-545.29.06-2.fc38.x86_64
(gdb) where
#0 0x0000155507bc4e09 in absl::lts_20240116::flags_internal::FlagRegistry::RegisterFlag(absl::lts_20240116::CommandLineFlag&, char const*) ()
from /home/jz748/anaconda3/envs/tfpt/lib/libabsl_flags_reflection.so.2401.0.0
#1 0x0000155507bc65c1 in absl::lts_20240116::flags_internal::RegisterCommandLineFlag(absl::lts_20240116::CommandLineFlag&, char const*) ()
from /home/jz748/anaconda3/envs/tfpt/lib/libabsl_flags_reflection.so.2401.0.0
#2 0x0000155507be4079 in _GLOBAL__sub_I_flags.cc () from /home/jz748/anaconda3/envs/tfpt/lib/libabsl_log_flags.so.2401.0.0
#3 0x0000155555525237 in call_init (env=0x7fffffffcd28, argv=0x7fffffffcd18, argc=1, l=<optimized out>) at dl-init.c:74
#4 call_init (l=<optimized out>, argc=1, argv=0x7fffffffcd18, env=0x7fffffffcd28) at dl-init.c:26
#5 0x000015555552532d in _dl_init (main_map=0x1555555552c0, argc=1, argv=0x7fffffffcd18, env=0x7fffffffcd28) at dl-init.c:121
#6 0x000015555553b840 in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
#7 0x0000000000000001 in ?? ()
#8 0x00007fffffffd220 in ?? ()
#9 0x0000000000000000 in ?? ()
Does the error reproduce if you comment out torch::Tensor tensor = torch::eye(3);
? IOW, does it need both tensorflow and pytorch to fail?
One can reproduce the error even if the C++ file is empty and just links both tensorflow and pytorch:
int main() {
return 0;
}
Got Segmentation fault.
But if removing -ltorch -ltorch_cpu -lc10
, or removing -ltensorflow_cc -ltensorflow_framework
, there's no error then.
IOW, does it need both tensorflow and pytorch to fail?
Yes, if one links both of them.
CC @xhochy @hmaarrfk @isuruf if you have any ideas what could cause this. My suspicion would be that we're not fully unvendoring abseil in either pytorch or tensorflow, and that something clashes when both are linked (though it's also very weird that this seems to happen even without the linker having to find a related symbol)
Pytorch might be linking things publically. I had to patch that at some point.
though it's also very weird that this seems to happen even without the linker having to find a related symbol
They sometimes do initialization routines in the startup
sections. I thought we had this happen in a few other situations.
While this is a "python" startup (and I know the C example might not hit this particular code path), I can image something like this. https://github.com/conda-forge/pytorch-cpu-feedstock/pull/244
Tensorflow seems to leak the symbols quite heavily
Pytorch leaks fewer but one does make it
nm -gD libtorch_cpu.so | grep absl
U _ZN6google8protobuf8internal17AssignDescriptorsEPFPKNS1_15DescriptorTableEvEPN4absl12lts_202401169once_flagERKNS0_8MetadataE
Note to self: https://conda-metadata-app.streamlit.app/ is useful.
nm -gD libtorch_cpu.so | grep absl U _ZN6google8protobuf8internal17AssignDescriptorsEPFPKNS1_15DescriptorTableEvEPN4absl12lts_202401169once_flagERKNS0_8MetadataE
I don't see this symbol in the PyTorch PyPI wheel. Is there something different in the build process on conda-forge?
Finally, I found a workaround: putting the link flag -labsl_log_flags
before other libraries. The link order matters.
# Segfault
g++ -o test_cc test_cc.cc -ltensorflow_cc -ltorch -ltorch_cpu -lc10 -I${PREFIX}/include -L${PREFIX}/lib -Wl,-rpath ${PREFIX}/lib
# No error
g++ -o test_cc test_cc.cc -labsl_log_flags -ltensorflow_cc -ltorch -ltorch_cpu -lc10 -I${PREFIX}/include -L${PREFIX}/lib -Wl,-rpath ${PREFIX}/lib
# Segfault
g++ -o test_cc test_cc.cc -ltensorflow_cc -ltorch -ltorch_cpu -lc10 -labsl_log_flags -I${PREFIX}/include -L${PREFIX}/lib -Wl,-rpath ${PREFIX}/lib
When using CMake, setting LDFLAGS
to -labsl_log_flags
.
export LDFLAGS="-labsl_log_flags ${LDFLAGS}"
No segfault after applying this workaround!
Thanks for digging into this @njzjz! 🙏
My suspicion remains that we've not fully unvendored all traces of abseil in tensorflow (perhaps through a bug i.e. something not being wired up correctly for TF_SYSTEM_LIBS=abseil
).
Comment:
In https://github.com/conda-forge/deepmd-kit-feedstock/pull/78, I am building a program linked with TensorFlow and PyTorch. When using the latest versions of them, i.e. TensorFlow 2.16 and PyTorch 2.3, there is a segmentation fault in
absl::lts_20240116::flags_internal::FlagRegistry::RegisterFlag
in either conda-forge images or the local environment, as shown below:However, these two things do work: (1) Pin tensorflow to 2.15 and pytorch to 2.1; (2) Use the same abseil version, but built against TensorFlow only, without PyTorch. (https://github.com/conda-forge/deepmd-kit-feedstock/pull/79)
I am not sure what the problem is. As a workaround, I pin tensorflow to 2.15 and pytorch to 2.1.