Open jkoudys opened 2 years ago
Here's the cargo build -vv
output:
aused by:
process didn't exit successfully: `CARGO=/home/jkoudys/.rustup/toolchains/stable-aarch64-unknown-linux-gnu/bin/cargo CARGO_BIN_NAME=clausehound-ml CARGO_CRATE_NAME=clausehound_ml CARGO_MANIFEST_DIR=/home/jkoudys/clausehound/clausehound-ml CARGO_PKG_AUTHORS='' CARGO_PKG_DESCRIPTION='' CARGO_PKG_HOMEPAGE='' CARGO_PKG_LICENSE='' CARGO_PKG_LICENSE_FILE='' CARGO_PKG_NAME=clausehound-ml CARGO_PKG_REPOSITORY='' CARGO_PKG_VERSION=0.1.0 CARGO_PKG_VERSION_MAJOR=0 CARGO_PKG_VERSION_MINOR=1 CARGO_PKG_VERSION_PATCH=0 CARGO_PKG_VERSION_PRE='' CARGO_PRIMARY_PACKAGE=1 LD_LIBRARY_PATH='/home/jkoudys/clausehound/clausehound-ml/target/debug/deps:/home/jkoudys/.rustup/toolchains/stable-aarch64-unknown-linux-gnu/lib:/home/jkoudys/.rustup/toolchains/stable-aarch64-unknown-linux-gnu/lib:/home/jkoudys/.local/lib/python3.10/site-packages/torch/lib' rustc --crate-name clausehound_ml --edition=2021 src/main.rs --error-format=json --json=diagnostic-rendered-ansi,artifacts,future-incompat --crate-type bin --emit=dep-info,link -C embed-bitcode=no -C debuginfo=2 -C metadata=831b607e724f3ee9 -C extra-filename=-831b607e724f3ee9 --out-dir /home/jkoudys/clausehound/clausehound-ml/target/debug/deps -C incremental=/home/jkoudys/clausehound/clausehound-ml/target/debug/incremental -L dependency=/home/jkoudys/clausehound/clausehound-ml/target/debug/deps --extern tch=/home/jkoudys/clausehound/clausehound-ml/target/debug/deps/libtch-b403d46711a7e857.rlib -L native=/home/jkoudys/.local/lib/python3.10/site-packages/torch/lib -L native=/home/jkoudys/clausehound/clausehound-ml/target/debug/build/torch-sys-6b8dc86f3bb2ab85/out -L native=/home/jkoudys/clausehound/clausehound-ml/target/debug/build/bzip2-sys-da4dd79fd35cb19f/out/lib -L native=/home/jkoudys/clausehound/clausehound-ml/target/debug/build/zstd-sys-fc39f9717c9fb857/out` (exit status: 1)
showing it's including the torch lib directory, where I can see those libs:
$ ls /home/jkoudys/.local/lib/python3.10/site-packages/torch/lib
libc10.so libgomp-d22c30c5.so.1 libshm.so libtorch_cpu.so libtorch_global_deps.so libtorch_python.so libtorch.so
Got it to build by manually deleting all those ASSERT macros throughout the headers. Sorta runs, but not great:
use tch::Tensor;
fn main() {
let t = Tensor::of_slice(&[3, 1, 4, 1, 5]);
let t = t * 2;
t.print();
}
gives:
6
2
8
2
10
[ CPUIntType{5} ]
free(): invalid pointer
Aborted (core dumped)
That's a problem for another day, but for this issue I just want to figure out if there's a config setting that needs to be used to get it to build properly on the 1.12.0 ARM build installed by pip.
Okay looks like this is a build flag mismatch between the pytorch installed by pip (1.12.0 or 1.12.1), as per the pytorch site, and the build flags tch-rs expects built in. The libs are different on arm, and clearly not tested as much as the x86 version.
In case it helps anyone, I was able to get it working by installing torch from source directly from git, and linking that:
$ git clone --branch release/1.12 https://github.com/pytorch/pytorch.git pytorch-1.12
$ cd pytorch-1.12
$ python setup.py install
then the usual to point the build to the installed torch, in my case added to my ~/.profile:
export LIBTORCH=/home/jkoudys/anaconda3/envs/tf/lib/python3.10/site-packages/torch/
export LD_LIBRARY_PATH=${LIBTORCH}/lib:$LD_LIBRARY_PATH
I'd setup the python3.10 using conda (in tf
), as the setup.py also had some conda installs it needed to do for deps.
Went back, cargo run, and the test runs showing the tensor (and without any crash at the end).
Still think there should be something in the install scripts, as the pip installed version doesn't work with this. Maybe the assert flag stuff needs to be turned off from tch-rs's headers?
Glad that you managed to get it to work. Re deactivating the assert flag in the header file, do you think this would get around the second problem you encountered (free(): invalid pointer
)?
Probably. The free was likely because I was deleting hundreds of macro calls indiscriminately and with hasty abandon, so one probably slipped past without a matching malloc. If the correct macros are defined, it should work.
Now I'm not entirely sure is this is on tch-rs, pytorch, the pip bundlers, etc. to fix. Seems odd they'd release a package with different debug flags on different targets.
On Sat, Sep 10, 2022 at 7:11 AM Laurent Mazare @.***> wrote:
Glad that you managed to get it to work. Re deactivating the assert flag in the header file, do you think this would get around the second problem you encountered (free(): invalid pointer)?
— Reply to this email directly, view it on GitHub https://github.com/LaurentMazare/tch-rs/issues/529#issuecomment-1242706537, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEPJ2NSF5BQKV5B7ZRZJD3V5RUGTANCNFSM6AAAAAAQIAECCM . You are receiving this because you authored the thread.Message ID: @.***>
Seeing the same problem when building libtorch using cmake on release 1.13.1 with aarch64 NixOS: https://github.com/pytorch/pytorch/blob/master/docs/libtorch.rst#building-libtorch-using-cmake
Trying python setup.py install
as mentioned in this thread.
Have the same problem when using python setup.py install
.
Looks like the error message
pytorch-install/include/c10/core/Device.h:166: undefined reference to `c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, std::string const&)'
For me is due to ABI problem, since
nm -D pytorch-install/lib/libc10.so |grep torchInternalAssertFail
0000000000049d08 T _ZN3c106detail23torchInternalAssertFailEPKcS2_jS2_RKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE
0000000000049c98 T _ZN3c106detail23torchInternalAssertFailEPKcS2_jS2_S2_
shows torchInternalAssertFail
. So likely due to ABI problem the linker could not recognize torchInternalAssertFail
in libc10.so
.
What fixed for me is setting export LIBTORCH_CXX11_ABI=1
before running cargo build
.
Here is the env that works for me:
git clone -b v1.13.1 --recurse-submodule https://github.com/pytorch/pytorch.git
mkdir pytorch-build
cd pytorch-build
cmake -DBUILD_SHARED_LIBS:BOOL=ON -DCMAKE_BUILD_TYPE:STRING=Release -DPYTHON_EXECUTABLE:PATH=`which python3` -DCMAKE_INSTALL_PREFIX:PATH=../pytorch-install ../pytorch
cmake --build . --target install
And then:
export LIBTORCH_CXX11_ABI=1
cargo build
Update: using GCC 12 with the above command is able to build libtorch but I still get a linking error. I have to revert back to GCC 8.5
Here is an example of building libtorch with GUIX:
guix shell zsh cmake make python python-pyyaml python-typing-extensions gcc-toolchain@8.5.0 -- zsh -c "cd ~/repos; rm -rf pytorch pytorch-install pytorch-build; git clone --depth 1 -b v1.13.1 --recurse-submodule https://github.com/pytorch/pytorch.git; mkdir pytorch-build; cd pytorch-build; cmake -DBUILD_SHARED_LIBS:BOOL=ON -DCMAKE_BUILD_TYPE:STRING=Release -DPYTHON_EXECUTABLE:PATH=`which python3` -DCMAKE_INSTALL_PREFIX:PATH=../pytorch-install ../pytorch; cmake --build . --target install -j `nproc`"
Then
export LIBTORCH_CXX11_ABI=1
cargo build
should work.
Edit: cargo build also need GCC 8.5 for cc
. With GCC 12 there is error.
@helinwang you are a genius, thank you.
For anyone else who searches for something like this I was running into the same problem of certain symbols in the libtorch libraries not being found. I am compiling/building pytorch/libtorch on a Raspberry Pi 4B in a Docker container (VSCode on MacOS, using remote development on a Raspberry Pi 4 with the build environment on the Pi being an Ubuntu 20.04 Docker container).
I have been fighting with this problem for a few days now and the gcc@8.5.0 was gold. I am using gcc 8.4.0 and it is also working. I have also tried gcc-9 and gcc-10 and both of them failed with the macro assert error.
I'm on ARM. snapdragon cpu on a chromebook chroot (crouton running debian). Seeing linker errors building the simple example app from the readme. Installed torch via pip3, and runs fine from python using some simple test scripts:
Yet I get this error everywhere their macro for assertions is used when I do
cargo build
:I see it running with the right lib flags
"-ltorch_cpu" "-ltorch"
, and thetorchInternalAssertFail
appears to exist in the lib:and the lib env vars are set:
I tried downgrading to 1.12.0 (I was 1.12.1 originally) and same problem.
The torchInternalAssertFail is only ever included when built by a macro, eg:
I'm wondering if perhaps the flag that defines these TORCH_ assert macros as calling torchInternalAssertFail in on in one place, but off in the other, so those functions don't get defined. I'm guessing it's an ARM build target thing, because nobody else seems to see this.