denera commented 4 months ago

Description

This PR improves how rpaths are handled for framework extensions, and makes the core TE library inherit the C++ ABI version setting from PyTorch when TE is being built for PyTorch integration.

Fixes # (issue)

Type of change

[ ] Documentation change (change only to the documentation, either a fix or a new content)
[x] Bug fix (non-breaking change which fixes an issue)
[x] New feature (non-breaking change which adds functionality)
[ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)

Changes

All TE framework extensions are now compiled with extra_link_args = [ '-Wl,'-rpath','$ORIGIN' ] in order to dynamically load libtransformer_engine.so from the same path that framework extension libraries are installed to by pip. As a result of this, transformer_engine/common/__init__.py no longer needs to load core TE library via ctypes.CDLL().
If TE is built for PyTorch integration, and the PyTorch installation on the system has been built with -D_GLIBCXX_USE_CXX11_ABI=1, then libtransformer_engine.so is compiled with the same C++11 ABI to ensure that the TE/PyTorch extension can link to and dynamically load the TE common library without symbol errors. In this case, other TE framework extensions (JAX and paddle paddle) also inherit the same C++11 ABI option if they're being built alongside TE/PyTorch.
TE/JAX extension is converted to a pybind11.setup_helpers.Pybind11Extension and builds as a setuptools.Extension like every other TE framework extension.
Misc. compile warnings cleaned up in TE/JAX extension.

Checklist:

[x] I have read and followed the contributing guidelines
[x] The functionality is complete
[x] I have commented my code, particularly in hard-to-understand areas
[x] I have made corresponding changes to the documentation
[x] My changes generate no new warnings
[x] I have added tests that prove my fix is effective or that my feature works
[x] New and existing unit tests pass locally with my changes

denera commented 4 months ago

/te-ci Pytorch

denera commented 4 months ago

/te-ci jax

denera commented 4 months ago

/te-ci pytorch

denera commented 4 months ago

/te-ci jax

denera commented 4 months ago

/te-ci paddle

denera commented 4 months ago

/te-ci paddle

denera commented 4 months ago

@ptrendx @ksivaman @timmoon10 CI is now clean for all three frameworks. I held off from merging because it's a significant build system change. Please let me know if it's clear to merge, or go ahead and merge whenever it is appropriate.

denera commented 4 months ago

@ksivaman No objections on my end. I think it makes sense to combine.

ksivaman commented 3 months ago

@denera Closing in favor of https://github.com/NVIDIA/TransformerEngine/pull/877

timmoon10 commented 3 months ago

:(

NVIDIA / TransformerEngine

[C/PyTorch/JAX] Build system improvements for rpath and C++11 ABI #858

Description

Type of change

Changes

Checklist: