ARM (snapdragon on debian) linker errors for macro-referenced torchInternalAssertFail

jkoudys commented 2 years ago

I'm on ARM. snapdragon cpu on a chromebook chroot (crouton running debian). Seeing linker errors building the simple example app from the readme. Installed torch via pip3, and runs fine from python using some simple test scripts:

import torch
x = torch.rand(5, 3)
print(x)

Yet I get this error everywhere their macro for assertions is used when I do cargo build:

  = note: /usr/bin/ld: /home/jkoudys/clausehound/clausehound-ml/target/debug/deps/libtorch_sys-589abf4835aabeb1.rlib(torch_api.o): in function `c10::Device::validate()':
          /home/jkoudys/.local/lib/python3.10/site-packages/torch/include/c10/core/Device.h:140: undefined reference to `c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)'
          /usr/bin/ld: /home/jkoudys/.local/lib/python3.10/site-packages/torch/include/c10/core/Device.h:144: undefined reference to `c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)'
          /usr/bin/ld: /home/jkoudys/clausehound/clausehound-ml/target/debug/deps/libtorch_sys-589abf4835aabeb1.rlib(torch_api.o): in function `caffe2::TypeMeta::fromScalarType(c10::ScalarType)':
          /home/jkoudys/.local/lib/python3.10/site-packages/torch/include/c10/util/typeid.h:473: undefined reference to `c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)'

I see it running with the right lib flags "-ltorch_cpu" "-ltorch", and the torchInternalAssertFail appears to exist in the lib:

$ nm -D /home/jkoudys/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so |grep Assert
                 U _ZN3c106detail23torchInternalAssertFailEPKcS2_jS2_RKSs
                 U _ZN3c106detail23torchInternalAssertFailEPKcS2_jS2_S2_

and the lib env vars are set:

$ echo $LIBTORCH
/home/jkoudys/.local/lib/python3.10/site-packages/torch
$ echo $LD_LIBRARY_PATH
/home/jkoudys/.local/lib/python3.10/site-packages/torch/lib

I tried downgrading to 1.12.0 (I was 1.12.1 originally) and same problem.

The torchInternalAssertFail is only ever included when built by a macro, eg:

    TORCH_INTERNAL_ASSERT_DEBUG_ONLY(
        index_ == -1 || index_ >= 0,
        "Device index must be -1 or non-negative, got ",
        (int)index_);

I'm wondering if perhaps the flag that defines these TORCH_ assert macros as calling torchInternalAssertFail in on in one place, but off in the other, so those functions don't get defined. I'm guessing it's an ARM build target thing, because nobody else seems to see this.

jkoudys commented 2 years ago

Here's the cargo build -vv output:

aused by:
  process didn't exit successfully: `CARGO=/home/jkoudys/.rustup/toolchains/stable-aarch64-unknown-linux-gnu/bin/cargo CARGO_BIN_NAME=clausehound-ml CARGO_CRATE_NAME=clausehound_ml CARGO_MANIFEST_DIR=/home/jkoudys/clausehound/clausehound-ml CARGO_PKG_AUTHORS='' CARGO_PKG_DESCRIPTION='' CARGO_PKG_HOMEPAGE='' CARGO_PKG_LICENSE='' CARGO_PKG_LICENSE_FILE='' CARGO_PKG_NAME=clausehound-ml CARGO_PKG_REPOSITORY='' CARGO_PKG_VERSION=0.1.0 CARGO_PKG_VERSION_MAJOR=0 CARGO_PKG_VERSION_MINOR=1 CARGO_PKG_VERSION_PATCH=0 CARGO_PKG_VERSION_PRE='' CARGO_PRIMARY_PACKAGE=1 LD_LIBRARY_PATH='/home/jkoudys/clausehound/clausehound-ml/target/debug/deps:/home/jkoudys/.rustup/toolchains/stable-aarch64-unknown-linux-gnu/lib:/home/jkoudys/.rustup/toolchains/stable-aarch64-unknown-linux-gnu/lib:/home/jkoudys/.local/lib/python3.10/site-packages/torch/lib' rustc --crate-name clausehound_ml --edition=2021 src/main.rs --error-format=json --json=diagnostic-rendered-ansi,artifacts,future-incompat --crate-type bin --emit=dep-info,link -C embed-bitcode=no -C debuginfo=2 -C metadata=831b607e724f3ee9 -C extra-filename=-831b607e724f3ee9 --out-dir /home/jkoudys/clausehound/clausehound-ml/target/debug/deps -C incremental=/home/jkoudys/clausehound/clausehound-ml/target/debug/incremental -L dependency=/home/jkoudys/clausehound/clausehound-ml/target/debug/deps --extern tch=/home/jkoudys/clausehound/clausehound-ml/target/debug/deps/libtch-b403d46711a7e857.rlib -L native=/home/jkoudys/.local/lib/python3.10/site-packages/torch/lib -L native=/home/jkoudys/clausehound/clausehound-ml/target/debug/build/torch-sys-6b8dc86f3bb2ab85/out -L native=/home/jkoudys/clausehound/clausehound-ml/target/debug/build/bzip2-sys-da4dd79fd35cb19f/out/lib -L native=/home/jkoudys/clausehound/clausehound-ml/target/debug/build/zstd-sys-fc39f9717c9fb857/out` (exit status: 1)

showing it's including the torch lib directory, where I can see those libs:

$ ls /home/jkoudys/.local/lib/python3.10/site-packages/torch/lib
libc10.so  libgomp-d22c30c5.so.1  libshm.so  libtorch_cpu.so  libtorch_global_deps.so  libtorch_python.so  libtorch.so

jkoudys commented 2 years ago

Got it to build by manually deleting all those ASSERT macros throughout the headers. Sorta runs, but not great:

use tch::Tensor;

fn main() {
    let t = Tensor::of_slice(&[3, 1, 4, 1, 5]);
    let t = t * 2;
    t.print();
}

gives:

  6
  2
  8
  2
 10
[ CPUIntType{5} ]
free(): invalid pointer
Aborted (core dumped)

That's a problem for another day, but for this issue I just want to figure out if there's a config setting that needs to be used to get it to build properly on the 1.12.0 ARM build installed by pip.

jkoudys commented 2 years ago

Okay looks like this is a build flag mismatch between the pytorch installed by pip (1.12.0 or 1.12.1), as per the pytorch site, and the build flags tch-rs expects built in. The libs are different on arm, and clearly not tested as much as the x86 version.

In case it helps anyone, I was able to get it working by installing torch from source directly from git, and linking that:

$ git clone --branch release/1.12 https://github.com/pytorch/pytorch.git pytorch-1.12
$ cd pytorch-1.12
$ python setup.py install

then the usual to point the build to the installed torch, in my case added to my ~/.profile:

export LIBTORCH=/home/jkoudys/anaconda3/envs/tf/lib/python3.10/site-packages/torch/
export LD_LIBRARY_PATH=${LIBTORCH}/lib:$LD_LIBRARY_PATH

I'd setup the python3.10 using conda (in tf), as the setup.py also had some conda installs it needed to do for deps.

Went back, cargo run, and the test runs showing the tensor (and without any crash at the end).

Still think there should be something in the install scripts, as the pip installed version doesn't work with this. Maybe the assert flag stuff needs to be turned off from tch-rs's headers?

LaurentMazare commented 2 years ago

Glad that you managed to get it to work. Re deactivating the assert flag in the header file, do you think this would get around the second problem you encountered (free(): invalid pointer)?

jkoudys commented 2 years ago

Probably. The free was likely because I was deleting hundreds of macro calls indiscriminately and with hasty abandon, so one probably slipped past without a matching malloc. If the correct macros are defined, it should work.

Now I'm not entirely sure is this is on tch-rs, pytorch, the pip bundlers, etc. to fix. Seems odd they'd release a package with different debug flags on different targets.

On Sat, Sep 10, 2022 at 7:11 AM Laurent Mazare @.***> wrote:

Glad that you managed to get it to work. Re deactivating the assert flag in the header file, do you think this would get around the second problem you encountered (free(): invalid pointer)?

— Reply to this email directly, view it on GitHub https://github.com/LaurentMazare/tch-rs/issues/529#issuecomment-1242706537, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEPJ2NSF5BQKV5B7ZRZJD3V5RUGTANCNFSM6AAAAAAQIAECCM . You are receiving this because you authored the thread.Message ID: @.***>

helinwang commented 1 year ago

Seeing the same problem when building libtorch using cmake on release 1.13.1 with aarch64 NixOS: https://github.com/pytorch/pytorch/blob/master/docs/libtorch.rst#building-libtorch-using-cmake Trying python setup.py install as mentioned in this thread.

helinwang commented 1 year ago

Have the same problem when using python setup.py install.

Looks like the error message

pytorch-install/include/c10/core/Device.h:166: undefined reference to `c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, std::string const&)'

For me is due to ABI problem, since

nm -D pytorch-install/lib/libc10.so |grep torchInternalAssertFail
0000000000049d08 T _ZN3c106detail23torchInternalAssertFailEPKcS2_jS2_RKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE
0000000000049c98 T _ZN3c106detail23torchInternalAssertFailEPKcS2_jS2_S2_

shows torchInternalAssertFail. So likely due to ABI problem the linker could not recognize torchInternalAssertFail in libc10.so.

What fixed for me is setting export LIBTORCH_CXX11_ABI=1 before running cargo build.

Here is the env that works for me:

gcc8.5 (gcc 9.2, 10, 11, 12 did not build libtorch for various reasons. gcc 9.5 should work fine)
NixOS, nixpkgs 22.05
aarch64 in an VM in m1 mac

git clone -b v1.13.1 --recurse-submodule https://github.com/pytorch/pytorch.git
mkdir pytorch-build
cd pytorch-build
cmake -DBUILD_SHARED_LIBS:BOOL=ON -DCMAKE_BUILD_TYPE:STRING=Release -DPYTHON_EXECUTABLE:PATH=`which python3` -DCMAKE_INSTALL_PREFIX:PATH=../pytorch-install ../pytorch
cmake --build . --target install

And then:

export LIBTORCH_CXX11_ABI=1
cargo build

helinwang commented 1 year ago

Update: using GCC 12 with the above command is able to build libtorch but I still get a linking error. I have to revert back to GCC 8.5

Here is an example of building libtorch with GUIX:

guix shell zsh cmake make python python-pyyaml python-typing-extensions gcc-toolchain@8.5.0 -- zsh -c "cd ~/repos; rm -rf pytorch pytorch-install pytorch-build; git clone --depth 1 -b v1.13.1 --recurse-submodule https://github.com/pytorch/pytorch.git; mkdir pytorch-build; cd pytorch-build; cmake -DBUILD_SHARED_LIBS:BOOL=ON -DCMAKE_BUILD_TYPE:STRING=Release -DPYTHON_EXECUTABLE:PATH=`which python3` -DCMAKE_INSTALL_PREFIX:PATH=../pytorch-install ../pytorch; cmake --build . --target install -j `nproc`"

Then

export LIBTORCH_CXX11_ABI=1
cargo build

should work.

Edit: cargo build also need GCC 8.5 for cc. With GCC 12 there is error.

crrodger commented 1 year ago

@helinwang you are a genius, thank you.

For anyone else who searches for something like this I was running into the same problem of certain symbols in the libtorch libraries not being found. I am compiling/building pytorch/libtorch on a Raspberry Pi 4B in a Docker container (VSCode on MacOS, using remote development on a Raspberry Pi 4 with the build environment on the Pi being an Ubuntu 20.04 Docker container).

I have been fighting with this problem for a few days now and the gcc@8.5.0 was gold. I am using gcc 8.4.0 and it is also working. I have also tried gcc-9 and gcc-10 and both of them failed with the macro assert error.

LaurentMazare / tch-rs

ARM (snapdragon on debian) linker errors for macro-referenced torchInternalAssertFail #529