NVIDIA / apex

A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch
BSD 3-Clause "New" or "Revised" License
8.31k stars 1.38k forks source link

cuda extensions are being compiled with a version of Cuda that does not match the version used to compile Pytorch binaries. #550

Closed antgr closed 4 years ago

antgr commented 4 years ago

NVIDIA-SMI 435.21 Driver Version: 435.21 CUDA Version: 10.1

pip3 install --user -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./ /usr/lib/python3.7/site-packages/pip/_internal/commands/install.py:217: UserWarning: Disabling all use of wheels due to the use of --build-options / --global-options / --install-options. cmdoptions.check_install_build_global(options) Created temporary directory: /tmp/pip-ephem-wheel-cache-lamo85tb Created temporary directory: /tmp/pip-req-tracker-du8h9cpx Created requirements tracker '/tmp/pip-req-tracker-du8h9cpx' Created temporary directory: /tmp/pip-install-wkd21x48 Processing /home/polykratis/mt-dnn/apex Created temporary directory: /tmp/pip-req-build-j8rxyv0b Added file:///home/polykratis/mt-dnn/apex to build tracker '/tmp/pip-req-tracker-du8h9cpx' Running setup.py (path:/tmp/pip-req-build-j8rxyv0b/setup.py) egg_info for package from file:///home/polykratis/mt-dnn/apex Running command python setup.py egg_info torch.version = 1.1.0 running egg_info creating pip-egg-info/apex.egg-info writing pip-egg-info/apex.egg-info/PKG-INFO writing dependency_links to pip-egg-info/apex.egg-info/dependency_links.txt writing top-level names to pip-egg-info/apex.egg-info/top_level.txt writing manifest file 'pip-egg-info/apex.egg-info/SOURCES.txt' reading manifest file 'pip-egg-info/apex.egg-info/SOURCES.txt' writing manifest file 'pip-egg-info/apex.egg-info/SOURCES.txt' /tmp/pip-req-build-j8rxyv0b/setup.py:43: UserWarning: Option --pyprof not specified. Not installing PyProf dependencies! warnings.warn("Option --pyprof not specified. Not installing PyProf dependencies!") Source in /tmp/pip-req-build-j8rxyv0b has version 0.1, which satisfies requirement apex==0.1 from file:///home/polykratis/mt-dnn/apex Removed apex==0.1 from file:///home/polykratis/mt-dnn/apex from build tracker '/tmp/pip-req-tracker-du8h9cpx' Installing collected packages: apex Created temporary directory: /tmp/pip-record-ehkfvhhb Running setup.py install for apex ... Running command /usr/bin/python3 -u -c "import setuptools, tokenize;file='/tmp/pip-req-build-j8rxyv0b/setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" --cpp_ext --cuda_ext install --record /tmp/pip-record-ehkfvhhb/install-record.txt --single-version-externally-managed --compile --user --prefix= torch.version = 1.1.0 /tmp/pip-req-build-j8rxyv0b/setup.py:43: UserWarning: Option --pyprof not specified. Not installing PyProf dependencies! warnings.warn("Option --pyprof not specified. Not installing PyProf dependencies!")

Compiling cuda extensions with
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130
from /usr/bin

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/tmp/pip-req-build-j8rxyv0b/setup.py", line 100, in <module>
    check_cuda_torch_binary_vs_bare_metal(torch.utils.cpp_extension.CUDA_HOME)
  File "/tmp/pip-req-build-j8rxyv0b/setup.py", line 77, in check_cuda_torch_binary_vs_bare_metal
    "https://github.com/NVIDIA/apex/pull/323#discussion_r287021798.  "
RuntimeError: Cuda extensions are being compiled with a version of Cuda that does not match the version used to compile Pytorch binaries.  Pytorch binaries were compiled with Cuda 9.0.176.
In some cases, a minor-version mismatch will not cause later errors:  https://github.com/NVIDIA/apex/pull/323#discussion_r287021798.  You can try commenting out this check (at your own risk).

error Cleaning up... Removing source in /tmp/pip-req-build-j8rxyv0b Removed build tracker '/tmp/pip-req-tracker-du8h9cpx' Command "/usr/bin/python3 -u -c "import setuptools, tokenize;file='/tmp/pip-req-build-j8rxyv0b/setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" --cpp_ext --cuda_ext install --record /tmp/pip-record-ehkfvhhb/install-record.txt --single-version-externally-managed --compile --user --prefix=" failed with error code 1 in /tmp/pip-req-build-j8rxyv0b/ Exception information: Traceback (most recent call last): File "/usr/lib/python3.7/site-packages/pip/_internal/cli/base_command.py", line 179, in main status = self.run(options, args) File "/usr/lib/python3.7/site-packages/pip/_internal/commands/install.py", line 421, in run strip_file_prefix=options.strip_file_prefix, File "/usr/lib/python3.7/site-packages/pip/_internal/req/init.py", line 57, in install_given_reqs **kwargs File "/usr/lib/python3.7/site-packages/pip/_internal/req/req_install.py", line 949, in install spinner=spinner, File "/usr/lib/python3.7/site-packages/pip/_internal/utils/misc.py", line 771, in call_subprocess % (command_desc, proc.returncode, cwd)) pip._internal.exceptions.InstallationError: Command "/usr/bin/python3 -u -c "import setuptools, tokenize;file='/tmp/pip-req-build-j8rxyv0b/setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" --cpp_ext --cuda_ext install --record /tmp/pip-record-ehkfvhhb/install-record.txt --single-version-externally-managed --compile --user --prefix=" failed with error code 1 in /tmp/pip-req-build-j8rxyv0b/ 1 location(s) to search for versions of pip:

dhpollack commented 4 years ago

You need to download CUDA 9.0 and then change your CUDA_HOME and PATH environmental variables to that folder. You can do this by downloading the .tar version of cuda and extracting manually if you want to have multiple version of CUDA. Otherwise, you can install pytorch 1.3 which is built against CUDA 10.1.

antgr commented 4 years ago

Thank you

pedrocolon93 commented 4 years ago

In case anyone has this problem, what worked for me was doing pip uninstall torch a few times and reinstalling with conda. It seems I had older versions of pytorch that apex was looking at. After uninstalling 3 times I got that torch is no longer installed and proceeded to install through conda. It worked after this.

ashesh-0 commented 4 years ago

Does anyone has figured out how to install apex with

CUDA Version 10.0*
torch==1.5.0

It will be immensely helpful if there is a way to install apex with most recent versions of CUDA and torch. I cannot downdrade CUDA to a lower version. Thanks !

hansen7 commented 4 years ago

Does anyone has figured out how to install apex with

CUDA Version 10.0*
torch==1.5.0

It will be immensely helpful if there is a way to install apex with most recent versions of CUDA and torch. I cannot downdrade CUDA to a lower version. Thanks !

I think PyTorch 1.5.0 is compiled with CUDA 10.2

ashesh-0 commented 4 years ago

Does anyone has figured out how to install apex with

CUDA Version 10.0*
torch==1.5.0

It will be immensely helpful if there is a way to install apex with most recent versions of CUDA and torch. I cannot downdrade CUDA to a lower version. Thanks !

I think PyTorch 1.5.0 is compiled with CUDA 10.2

Yeah. That is correct. Since I could not upgrade CUDA, I downgraded pytorch. conda install gxx_linux-64 and conda install pytorch torchvision cudatoolkit=10.0 -c pytorch did the trick for me.

gauenk commented 4 years ago

I fixed this issue by running export CUDA_HOME=/usr/local/cuda-10.2/

pyaf commented 3 years ago

So, I had cuda-10.0 installed on my system (only /usr/local/cuda-10.0), I had installed pytorch with cuda-11.0, and that's why this compilation was throwing this error. I installed cuda-11.0 toolkit (only, didn't touch the drivers), and I had two cuda versions on my system after this (which is completely fine, you just need to point to the one you wanna use at the time of compilations and stuff). After this, I just did export CUDA_HOME=/usr/local/cuda-11.0/ and tried compiling again. It worked!

Lavenderjiang commented 3 years ago

After a long time of Googling, I found each version of cuda has different compatibility for gcc. For me, I was using cuda 10.2, and downgrading gcc to 6.1 solved this problem.

serg06 commented 2 years ago

Thanks guys! I was able to install Apex for my conda PyTorch installation with your help. Here is the full step by step:

daafonsecato commented 2 years ago

In addition to https://github.com/NVIDIA/apex/issues/550#issuecomment-1059985098 remember to change the specific version you want to add in the install command sudo apt-get -y install cuda-11-3

Rainbowman0 commented 1 year ago

You can use the nvidia-smi and nvcc -V commands to check whether the NVIDIA CUDA driver version is consistent with the cuda compiler version. If it is not consistent, this error will be reported. For example, my previous version, as shown in the figure below, will lead to the same error. image image