NVIDIA / TransformerEngine

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization in both training and inference.
Apache License 2.0
1.85k stars 310 forks source link

Undefined symbols when installed with public PyTorch (C++11 ABI issue) #906

Closed borisfom closed 3 months ago

borisfom commented 3 months ago

Is TE supposed to automatically detect C++ ABI used by Torch ? Or does it only work for Pytorch from NV container ? I tried to pip install TE in environment with public Torch and ended up with undefined symbols in .so - clearly looking for non-c++11 ABI symbol : E ImportError: /home/bfomitchev/.local/lib/python3.10/site-packages/transformer_engine_extensions.cpython-310-x86_64-lin\ ux-gnu.so: undefined symbol: _ZN18transformer_engine6getenvIiEETRKSsRKS1 Now the symbols in installed .so looks like c++11: bfomitchev@aiapps-040822:/tmp/nemo_build/TransformerEngine/tests/pytorch$ nm/home/bfomitchev/.local/lib/python3.10/site-packages/libtransformer_engine.so | grep getenv … 00000000003756a0 T _ZN18transformer_engine6getenvIaEET_RKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE

bfomitchev@aiapps-040822:/tmp/nemo_build/TransformerEngine/tests/pytorch$ pip show torch Name: torch Version: 2.3.1

timmoon10 commented 3 months ago

This particular error should be fixed with https://github.com/NVIDIA/TransformerEngine/pull/896.

A somewhat more robust solution would be to build TE with the same ABI as PyTorch (see https://github.com/NVIDIA/TransformerEngine/issues/756#issuecomment-2046572610 and https://github.com/NVIDIA/TransformerEngine/pull/858). However, this would not help us if we are installing TE from a pip wheel instead of building from source (see https://github.com/NVIDIA/TransformerEngine/pull/877).

borisfom commented 3 months ago

Confirmed that https://github.com/NVIDIA/TransformerEngine/pull/896 fixes this issue. If there are no other use cases of TE failure because of ABI mismatch with Pytorch, please feel free to close this issue.