Closed borisfom closed 3 months ago
This particular error should be fixed with https://github.com/NVIDIA/TransformerEngine/pull/896.
A somewhat more robust solution would be to build TE with the same ABI as PyTorch (see https://github.com/NVIDIA/TransformerEngine/issues/756#issuecomment-2046572610 and https://github.com/NVIDIA/TransformerEngine/pull/858). However, this would not help us if we are installing TE from a pip wheel instead of building from source (see https://github.com/NVIDIA/TransformerEngine/pull/877).
Confirmed that https://github.com/NVIDIA/TransformerEngine/pull/896 fixes this issue. If there are no other use cases of TE failure because of ABI mismatch with Pytorch, please feel free to close this issue.
Is TE supposed to automatically detect C++ ABI used by Torch ? Or does it only work for Pytorch from NV container ? I tried to pip install TE in environment with public Torch and ended up with undefined symbols in .so - clearly looking for non-c++11 ABI symbol : E ImportError: /home/bfomitchev/.local/lib/python3.10/site-packages/transformer_engine_extensions.cpython-310-x86_64-lin\ ux-gnu.so: undefined symbol: _ZN18transformer_engine6getenvIiEETRKSsRKS1 Now the symbols in installed .so looks like c++11: bfomitchev@aiapps-040822:/tmp/nemo_build/TransformerEngine/tests/pytorch$ nm/home/bfomitchev/.local/lib/python3.10/site-packages/libtransformer_engine.so | grep getenv … 00000000003756a0 T _ZN18transformer_engine6getenvIaEET_RKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE
bfomitchev@aiapps-040822:/tmp/nemo_build/TransformerEngine/tests/pytorch$ pip show torch Name: torch Version: 2.3.1