segfault in abseil 20240126.2 with TensorFlow 2.16 and PyTorch 2.3

njzjz commented 5 months ago

Comment:

In https://github.com/conda-forge/deepmd-kit-feedstock/pull/78, I am building a program linked with TensorFlow and PyTorch. When using the latest versions of them, i.e. TensorFlow 2.16 and PyTorch 2.3, there is a segmentation fault in absl::lts_20240116::flags_internal::FlagRegistry::RegisterFlag in either conda-forge images or the local environment, as shown below:

(gdb) where
#0  0x0000155541a60e09 in absl::lts_20240116::flags_internal::FlagRegistry::RegisterFlag(absl::lts_20240116::CommandLineFlag&, char const*) ()
   from /home/jz748/anaconda3/envs/test-deepmd-build/bin/../lib/./python3.11/site-packages/tensorflow/../../../libabsl_flags_reflection.so.2401.0.0
#1  0x0000155541a625c1 in absl::lts_20240116::flags_internal::RegisterCommandLineFlag(absl::lts_20240116::CommandLineFlag&, char const*) ()
   from /home/jz748/anaconda3/envs/test-deepmd-build/bin/../lib/./python3.11/site-packages/tensorflow/../../../libabsl_flags_reflection.so.2401.0.0
#2  0x0000155541a80079 in _GLOBAL__sub_I_flags.cc ()
   from /home/jz748/anaconda3/envs/test-deepmd-build/bin/../lib/./python3.11/site-packages/tensorflow/../../../libabsl_log_flags.so.2401.0.0
#3  0x0000155555525237 in call_init (env=0x55555567dbf0, argv=0x7fffffffad58, argc=3, l=<optimized out>) at dl-init.c:74
#4  call_init (l=<optimized out>, argc=3, argv=0x7fffffffad58, env=0x55555567dbf0) at dl-init.c:26
#5  0x000015555552532d in _dl_init (main_map=0x555555780eb0, argc=3, argv=0x7fffffffad58, env=0x55555567dbf0) at dl-init.c:121
#6  0x00001555555215c2 in __GI__dl_catch_exception (exception=exception@entry=0x0, operate=operate@entry=0x15555552bf50 <call_dl_init>,
    args=args@entry=0x7fffffffa290) at dl-catch.c:211
#7  0x000015555552beec in dl_open_worker (a=a@entry=0x7fffffffa440) at dl-open.c:827
#8  0x0000155555521523 in __GI__dl_catch_exception (exception=exception@entry=0x7fffffffa420,
    operate=operate@entry=0x15555552be50 <dl_open_worker>, args=args@entry=0x7fffffffa440) at dl-catch.c:237
#9  0x000015555552c2e4 in _dl_open (file=0x555555780cc0 "/home/jz748/anaconda3/envs/test-deepmd-build/lib/deepmd_lmp/dpplugin.so",
    mode=<optimized out>, caller_dlopen=0x155550f40916 <LAMMPS_NS::plugin_load(char const*, LAMMPS_NS::LAMMPS*)+166>, nsid=<optimized out>,
    argc=3, argv=0x7fffffffad58, env=0x55555567dbf0) at dl-open.c:903
#10 0x000015554fcc7714 in dlopen_doit () from /lib64/libc.so.6
#11 0x0000155555521523 in __GI__dl_catch_exception (exception=exception@entry=0x7fffffffa630, operate=0x15554fcc76b0 <dlopen_doit>,
    args=0x7fffffffa6f0) at dl-catch.c:237
#12 0x0000155555521679 in _dl_catch_error (objname=0x7fffffffa698, errstring=0x7fffffffa6a0, mallocedp=0x7fffffffa697, operate=<optimized out>,
    args=<optimized out>) at dl-catch.c:256
#13 0x000015554fcc71f3 in _dlerror_run () from /lib64/libc.so.6
#14 0x000015554fcc77cf in dlopen@GLIBC_2.2.5 () from /lib64/libc.so.6
#15 0x0000155550f40916 in LAMMPS_NS::plugin_load(char const*, LAMMPS_NS::LAMMPS*) ()
   from /home/jz748/anaconda3/envs/test-deepmd-build/bin/../lib/liblammps.so.0
#16 0x0000155550f40f66 in LAMMPS_NS::plugin_auto_load(LAMMPS_NS::LAMMPS*) ()
   from /home/jz748/anaconda3/envs/test-deepmd-build/bin/../lib/liblammps.so.0
#17 0x00001555507fe6ed in LAMMPS_NS::LAMMPS::LAMMPS(int, char**, ompi_communicator_t*) ()
   from /home/jz748/anaconda3/envs/test-deepmd-build/bin/../lib/liblammps.so.0
#18 0x0000555555556217 in main ()

However, these two things do work: (1) Pin tensorflow to 2.15 and pytorch to 2.1; (2) Use the same abseil version, but built against TensorFlow only, without PyTorch. (https://github.com/conda-forge/deepmd-kit-feedstock/pull/79)

I am not sure what the problem is. As a workaround, I pin tensorflow to 2.15 and pytorch to 2.1.

h-vetinari commented 5 months ago

Do you have more information about the calling code? We even have a specific flags test here that runs on every CI run. It's also one of the most-used pieces of abseil, so it must be some very weird corner case for this to not show up in many more places.

njzjz commented 5 months ago

It just uses dlopen to load a library, i.e.

dlopen(fname.c_str(), RTLD_NOW | RTLD_GLOBAL);

The library here is linked against the abseil, tensorflow, and pytorch. It seems no other specific codes are called.

njzjz commented 5 months ago

I found it can be reproduced by a simple example:

test_cc.cc:

#include "tensorflow/core/public/session.h"
#include "tensorflow/core/platform/env.h"
#include "tensorflow/core/framework/op.h"
#include "tensorflow/core/framework/op_kernel.h"
#include "tensorflow/core/framework/shape_inference.h"

#include "tensorflow/cc/client/client_session.h"
#include "tensorflow/cc/ops/standard_ops.h"
#include "tensorflow/core/framework/tensor.h"
#include "torch/script.h"

int main() {
  using namespace tensorflow;
  using namespace tensorflow::ops;
  Scope root = Scope::NewRootScope();
  // Matrix A = [3 2; -1 0]
  auto A = Const(root, { {3.f, 2.f}, {-1.f, 0.f} });
  // Vector b = [3 5]
  auto b = Const(root, { {3.f, 5.f} });
  // v = Ab^T
  auto v = MatMul(root.WithOpName("v"), A, b, MatMul::TransposeB(true));
  std::vector<Tensor> outputs;
  ClientSession session(root);
  // Run and fetch v
  TF_CHECK_OK(session.Run({v}, &outputs));
  // Expect outputs[0] == [19; -3]
  LOG(INFO) << outputs[0].matrix<float>();

  torch::Tensor tensor = torch::eye(3);

  return 0;
}

compile.sh:

#!/bin/bash

set -exuo pipefail

# This environment installs the latest TensorFlow and PyTorch
PREFIX=~/anaconda3/envs/tfpt
g++ -o test_cc test_cc.cc -ltensorflow_cc -ltensorflow_framework -ltorch -ltorch_cpu -lc10 -labsl_status -I${PREFIX}/include -L${PREFIX}/lib -Wl,-rpath ${PREFIX}/lib
./test_cc

Got:

+ ./test_cc
./compile.sh: line 7: 1135726 Segmentation fault      (core dumped) ./test_cc

GDB:

(gdb) r
Starting program: /home/jz748/tmp/testtfpt/test_cc
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".

Program received signal SIGSEGV, Segmentation fault.
0x0000155507bc4e09 in absl::lts_20240116::flags_internal::FlagRegistry::RegisterFlag(absl::lts_20240116::CommandLineFlag&, char const*) ()
   from /home/jz748/anaconda3/envs/tfpt/lib/libabsl_flags_reflection.so.2401.0.0
Missing separate debuginfos, use: dnf debuginfo-install glibc-2.37-18.fc38.x86_64 xorg-x11-drv-nvidia-cuda-libs-545.29.06-2.fc38.x86_64
(gdb) where
#0  0x0000155507bc4e09 in absl::lts_20240116::flags_internal::FlagRegistry::RegisterFlag(absl::lts_20240116::CommandLineFlag&, char const*) ()
   from /home/jz748/anaconda3/envs/tfpt/lib/libabsl_flags_reflection.so.2401.0.0
#1  0x0000155507bc65c1 in absl::lts_20240116::flags_internal::RegisterCommandLineFlag(absl::lts_20240116::CommandLineFlag&, char const*) ()
   from /home/jz748/anaconda3/envs/tfpt/lib/libabsl_flags_reflection.so.2401.0.0
#2  0x0000155507be4079 in _GLOBAL__sub_I_flags.cc () from /home/jz748/anaconda3/envs/tfpt/lib/libabsl_log_flags.so.2401.0.0
#3  0x0000155555525237 in call_init (env=0x7fffffffcd28, argv=0x7fffffffcd18, argc=1, l=<optimized out>) at dl-init.c:74
#4  call_init (l=<optimized out>, argc=1, argv=0x7fffffffcd18, env=0x7fffffffcd28) at dl-init.c:26
#5  0x000015555552532d in _dl_init (main_map=0x1555555552c0, argc=1, argv=0x7fffffffcd18, env=0x7fffffffcd28) at dl-init.c:121
#6  0x000015555553b840 in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
#7  0x0000000000000001 in ?? ()
#8  0x00007fffffffd220 in ?? ()
#9  0x0000000000000000 in ?? ()

h-vetinari commented 5 months ago

Does the error reproduce if you comment out torch::Tensor tensor = torch::eye(3);? IOW, does it need both tensorflow and pytorch to fail?

njzjz commented 5 months ago

One can reproduce the error even if the C++ file is empty and just links both tensorflow and pytorch:

int main() {
  return 0;
}

Got Segmentation fault.

But if removing -ltorch -ltorch_cpu -lc10, or removing -ltensorflow_cc -ltensorflow_framework, there's no error then.

IOW, does it need both tensorflow and pytorch to fail?

Yes, if one links both of them.

h-vetinari commented 5 months ago

CC @xhochy @hmaarrfk @isuruf if you have any ideas what could cause this. My suspicion would be that we're not fully unvendoring abseil in either pytorch or tensorflow, and that something clashes when both are linked (though it's also very weird that this seems to happen even without the linker having to find a related symbol)

hmaarrfk commented 5 months ago

Pytorch might be linking things publically. I had to patch that at some point.

hmaarrfk commented 5 months ago

though it's also very weird that this seems to happen even without the linker having to find a related symbol

They sometimes do initialization routines in the startup sections. I thought we had this happen in a few other situations.

While this is a "python" startup (and I know the C example might not hit this particular code path), I can image something like this. https://github.com/conda-forge/pytorch-cpu-feedstock/pull/244

hmaarrfk commented 4 months ago

Tensorflow seems to leak the symbols quite heavily

Pytorch leaks fewer but one does make it

nm -gD libtorch_cpu.so | grep absl
                 U _ZN6google8protobuf8internal17AssignDescriptorsEPFPKNS1_15DescriptorTableEvEPN4absl12lts_202401169once_flagERKNS0_8MetadataE

Note to self: https://conda-metadata-app.streamlit.app/ is useful.

njzjz commented 2 months ago

nm -gD libtorch_cpu.so | grep absl U _ZN6google8protobuf8internal17AssignDescriptorsEPFPKNS1_15DescriptorTableEvEPN4absl12lts_202401169once_flagERKNS0_8MetadataE

I don't see this symbol in the PyTorch PyPI wheel. Is there something different in the build process on conda-forge?

njzjz commented 1 month ago

Finally, I found a workaround: putting the link flag -labsl_log_flags before other libraries. The link order matters.

# Segfault
g++ -o test_cc test_cc.cc -ltensorflow_cc -ltorch -ltorch_cpu -lc10 -I${PREFIX}/include -L${PREFIX}/lib -Wl,-rpath ${PREFIX}/lib

# No error
g++ -o test_cc test_cc.cc -labsl_log_flags -ltensorflow_cc -ltorch -ltorch_cpu -lc10 -I${PREFIX}/include -L${PREFIX}/lib -Wl,-rpath ${PREFIX}/lib

# Segfault
g++ -o test_cc test_cc.cc -ltensorflow_cc -ltorch -ltorch_cpu -lc10 -labsl_log_flags -I${PREFIX}/include -L${PREFIX}/lib -Wl,-rpath ${PREFIX}/lib

When using CMake, setting LDFLAGS to -labsl_log_flags.

export LDFLAGS="-labsl_log_flags ${LDFLAGS}"

No segfault after applying this workaround!

h-vetinari commented 4 weeks ago

Thanks for digging into this @njzjz! 🙏

My suspicion remains that we've not fully unvendored all traces of abseil in tensorflow (perhaps through a bug i.e. something not being wired up correctly for TF_SYSTEM_LIBS=abseil).

conda-forge / abseil-cpp-feedstock

segfault in abseil 20240126.2 with TensorFlow 2.16 and PyTorch 2.3 #79

Comment: