NVIDIA / TransformerEngine

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization in both training and inference.
https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/index.html
Apache License 2.0
2k stars 333 forks source link

TransformerEngine doesn't work with uv #1323

Open jennifgcrl opened 3 weeks ago

jennifgcrl commented 3 weeks ago

setup.py calls uninstall_te_wheel_packages which fails because pip is not a module. This is expected because I use uv instead of pip. I think TransformerEngine should not assume pip.

Full error:

error: Failed to run uv compile   × Failed to download and build `transformer-engine @
  │ git+https://github.com/NVIDIA/TransformerEngine.git`
  ╰─▶ Build backend failed to determine requirements with `build_wheel()`
      (exit status: 1)

      [stderr]
      /home/jennifer/.cache/uv/builds-v0/.tmp22OFxW/bin/python: No module
      named pip
      Traceback (most recent call last):
        File "<string>", line 14, in <module>
        File
      "/home/jennifer/.cache/uv/builds-v0/.tmp22OFxW/lib/python3.12/site-packages/setuptools/build_meta.py",
      line 333, in get_requires_for_build_wheel
          return self._get_build_requires(config_settings, requirements=[])
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File
      "/home/jennifer/.cache/uv/builds-v0/.tmp22OFxW/lib/python3.12/site-packages/setuptools/build_meta.py",
      line 303, in _get_build_requires
          self.run_setup()
        File
      "/home/jennifer/.cache/uv/builds-v0/.tmp22OFxW/lib/python3.12/site-packages/setuptools/build_meta.py",
      line 521, in run_setup
          super().run_setup(setup_script=setup_script)
        File
      "/home/jennifer/.cache/uv/builds-v0/.tmp22OFxW/lib/python3.12/site-packages/setuptools/build_meta.py",
      line 319, in run_setup
          exec(code, locals())
        File "<string>", line 148, in <module>
        File
      "/home/jennifer/.cache/uv/git-v0/checkouts/f679917ede501d2d/e5ffaa7/build_tools/utils.py",
      line 305, in uninstall_te_wheel_packages
          subprocess.check_call(
        File
      "/home/jennifer/.rye/py/cpython@3.12.7/lib/python3.12/subprocess.py",
      line 413, in check_call
          raise CalledProcessError(retcode, cmd)
      subprocess.CalledProcessError: Command
      '['/home/jennifer/.cache/uv/builds-v0/.tmp22OFxW/bin/python',
      '-m', 'pip', 'uninstall', '-y', 'transformer_engine_cu12',
      'transformer_engine_torch', 'transformer_engine_paddle',
      'transformer_engine_jax']' returned non-zero exit status 1.

. uv exited with status: exit status: 1
jennifgcrl commented 3 weeks ago

I think that setup.py's mechanism of deciding which frameworks to build based on presence in the current python environment is incompatible with how uv works, i.e., packages are built in a clean python environment.

timmoon10 commented 2 weeks ago

TE has these hard-coded Pip calls to work around two problems:

  1. TE doesn't have a graceful way of handling setup-time dependencies. We use Pip to install Pybind11.
  2. TE has different package structures when installing from PyPI (core + framework packages) and building from source (monolithic package). We need to clean up the framework packages when installing from source to avoid importing the wrong thing.

The "right" solution to 1 is adding a pyproject.toml (see https://github.com/NVIDIA/TransformerEngine/pull/981 and https://github.com/NVIDIA/TransformerEngine/pull/1061). One problem is that most users will want to disable build isolation (especially when using NGC containers with optimized PyTorch builds), and it's quite annoying we can't make that the default. We may just need to bite the bullet.

Fixing 2 is trickier and would require a revamp of our build infrastructure. It seems to me that the right approach is to install separate framework packages when building from source, so that the structure matches the PyPI case. Pinging @ksivaman.

If you can't rely on automatically detecting which DL frameworks are installed, then the best approach is to set NVTE_FRAMEWORK in the environment (see docs). You can also manually set the DL framework when installing from PyPI with something like pip install transformer_engine[jax] (so I assume uv pip transformer_engine[jax]?).