NVIDIA / TransformerEngine

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization in both training and inference.
https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/index.html
Apache License 2.0
1.85k stars 310 forks source link

Undefined symbols when installed with public PyTorch (C++11 ABI issue) #906

Closed borisfom closed 3 months ago

borisfom commented 3 months ago

Is TE supposed to automatically detect C++ ABI used by Torch ? Or does it only work for Pytorch from NV container ? I tried to pip install TE in environment with public Torch and ended up with undefined symbols in .so - clearly looking for non-c++11 ABI symbol : E ImportError: /home/bfomitchev/.local/lib/python3.10/site-packages/transformer_engine_extensions.cpython-310-x86_64-lin\ ux-gnu.so: undefined symbol: _ZN18transformer_engine6getenvIiEETRKSsRKS1 Now the symbols in installed .so looks like c++11: bfomitchev@aiapps-040822:/tmp/nemo_build/TransformerEngine/tests/pytorch$ nm/home/bfomitchev/.local/lib/python3.10/site-packages/libtransformer_engine.so | grep getenv … 00000000003756a0 T _ZN18transformer_engine6getenvIaEET_RKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE

bfomitchev@aiapps-040822:/tmp/nemo_build/TransformerEngine/tests/pytorch$ pip show torch Name: torch Version: 2.3.1

timmoon10 commented 3 months ago

This particular error should be fixed with https://github.com/NVIDIA/TransformerEngine/pull/896.

A somewhat more robust solution would be to build TE with the same ABI as PyTorch (see https://github.com/NVIDIA/TransformerEngine/issues/756#issuecomment-2046572610 and https://github.com/NVIDIA/TransformerEngine/pull/858). However, this would not help us if we are installing TE from a pip wheel instead of building from source (see https://github.com/NVIDIA/TransformerEngine/pull/877).

borisfom commented 3 months ago

Confirmed that https://github.com/NVIDIA/TransformerEngine/pull/896 fixes this issue. If there are no other use cases of TE failure because of ABI mismatch with Pytorch, please feel free to close this issue.