NVIDIA / apex

A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch
BSD 3-Clause "New" or "Revised" License
8.2k stars 1.36k forks source link

Issue Installing Apex in WSL Environment #1676

Open l8g opened 1 year ago

l8g commented 1 year ago

🐛 Bug

I'm having a problem installing Apex in a WSL environment. It seems the installation script for Apex is trying to find the CUDA installation directory and run the nvcc -V command. In a WSL environment, despite CUDA being supported through the NVIDIA WSL driver, there may not exist a proper CUDA installation directory, and nvcc may not be added to the PATH environment variable.

To Reproduce I attempted to install Apex in WSL using the following commands:

git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir ./

I then received the following error:

File "/home/ldd/nlp/apex/setup.py", line 130, in <module>
  _, bare_metal_version = get_cuda_bare_metal_version(CUDA_HOME)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ldd/nlp/apex/setup.py", line 17, in get_cuda_bare_metal_version
  raw_output = subprocess.check_output([cuda_dir + "/bin/nvcc", "-V"], universal_newlines=True)
                                        ~~~~~~~~~^~~~~~~~~~~~~
TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'

Expected behavior I would expect Apex to be installable in a WSL environment without needing a full CUDA installation directory or nvcc.

Environment

OS: Ubuntu 20.04 on WSL 2 Python version: 3.11 PyTorch version: 2.0.1 CUDA version: NVIDIA CUDA 11.3 driver for Windows GPU models: [e.g. NVIDIA RTX 2080] Apex version: master branch as of 2023-06-06 GCC version: [e.g. 7.5] Any other relevant information: Additional context I'm trying to run a deep learning project that depends on Apex. I'm unable to run this project as Apex cannot be installed in my WSL environment.

crcrpar commented 1 year ago

Did you set CUDA_HOME environment variable? If not, could you try the environment variable?

l8g commented 1 year ago

I appreciate your prompt response. I would like to clarify a few things about my setup:

  1. I am working within a WSL2 (Windows Subsystem for Linux) environment, not a traditional Linux one.
  2. In my Windows system, the CUDA installation directory does not contain all the files and directories that WSL2 would expect from a full CUDA install. In particular, nvcc is missing because WSL2, by design, does not have access to a full CUDA installation that is present in the Windows system.
  3. I have tried setting the CUDA_HOME environment variable in WSL2, but it did not solve the issue, as the directory that this variable points to does not contain nvcc.

Given these constraints, I am currently unable to install Apex in my WSL2 environment. I was wondering if you have any recommendations for installing Apex under WSL2, or if there are plans to support WSL2 in the future?

Thank you for your time and consideration.

Best regards,

crcrpar commented 1 year ago

thank you for your clarification. I misunderstood some bits. could you try commenting out https://github.com/NVIDIA/apex/blob/05091d498d21058a0fe736b828c43431d4f0dda2/setup.py#L130 ? since your install command doesn't depend on any custom extensions, I don't think the cuda ver check is needed

l8g commented 1 year ago

Thank you very much for your guidance. I have successfully installed Apex and it seems to be functioning correctly. I ran a script that tests Apex's Automatic Mixed Precision (AMP) feature and everything worked as expected.

However, during the test, I received a warning that the multi_tensor_applier fused unscale kernel is unavailable, and Apex was using a Python fallback. The message suggested this might be because Apex was installed without --cuda_ext --cpp_ext.

This isn't causing me any problems at the moment, but I am wondering if it may impact performance, and if so, what I should do about it. My understanding is that compiling the CUDA extensions in WSL2 might be nontrivial due to the unique setup.

Any further advice you could provide on this topic would be greatly appreciated.

ingura commented 10 months ago

For others that run into this problem besides the version check comment I had to add "packaging" and "torch" as requirements to pyproject.toml in order to make it work