NVIDIA / apex

A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch
BSD 3-Clause "New" or "Revised" License
8.27k stars 1.37k forks source link

Can't install CUDA and C++ version #462

Open yangkevin2 opened 4 years ago

yangkevin2 commented 4 years ago

Hi,

I followed the quick-start instructions in the README, and I am able to install the python-only version just fine. However, when I try to install the CUDA and C++ version (after uninstalling the python-only version) using

pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

I eventually get the following error, and nothing is installed:

Traceback (most recent call last): File "/data/rsg/chemistry/yangk/conda/envs/cuda10/lib/python3.6/site-packages/pip/_internal/cli/base_command.py", line 178, in main status = self.run(options, args) File "/data/rsg/chemistry/yangk/conda/envs/cuda10/lib/python3.6/site-packages/pip/_internal/commands/install.py", line 414, in run use_user_site=options.use_user_site, File "/data/rsg/chemistry/yangk/conda/envs/cuda10/lib/python3.6/site-packages/pip/_internal/req/init.py", line 58, in install_given_reqs **kwargs File "/data/rsg/chemistry/yangk/conda/envs/cuda10/lib/python3.6/site-packages/pip/_internal/req/req_install.py", line 953, in install spinner=spinner, File "/data/rsg/chemistry/yangk/conda/envs/cuda10/lib/python3.6/site-packages/pip/_internal/utils/misc.py", line 776, in call_subprocess % (command_desc, proc.returncode, cwd)) pip._internal.exceptions.InstallationError: Command "/data/rsg/chemistry/yangk/conda/envs/cuda10/bin/python -u -c 'import setuptools, tokenize;file='"'"'/tmp/pip-req-build-fpm1mkcy/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' --cpp_ext --cuda_ext install --record /tmp/pip-record-bio8bemk/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /tmp/pip-req-build-fpm1mkcy/

I'm on Linux using a conda environment running python 3.6.8, pytorch 1.1.0, and CUDA 10.0.130. Is there some other issue with my environment?

Thanks!

ptrblck commented 4 years ago

Hi @yangkevin2,

could you post the complete install log, as it seems the error message is missing?

yangkevin2 commented 4 years ago

Sorry, here's the full log:

/data/rsg/chemistry/yangk/conda/envs/cuda10/lib/python3.6/site-packages/pip/_internal/commands/install.py:244: UserWarning: Disabling all use of wheels due to the use of --build-options / --global-options / --install-options.
  cmdoptions.check_install_build_global(options)
Created temporary directory: /tmp/pip-ephem-wheel-cache-2fyqbfto
Created temporary directory: /tmp/pip-req-tracker-gdqvsxif
Created requirements tracker '/tmp/pip-req-tracker-gdqvsxif'
Created temporary directory: /tmp/pip-install-rthpl4ei
Processing /data/rsg/chemistry/yangk/apex
  Created temporary directory: /tmp/pip-req-build-chq189jg
  Added file:///data/rsg/chemistry/yangk/apex to build tracker '/tmp/pip-req-tracker-gdqvsxif'
    Running setup.py (path:/tmp/pip-req-build-chq189jg/setup.py) egg_info for package from file:///data/rsg/chemistry/yangk/apex
    Running command python setup.py egg_info
    torch.__version__  =  1.1.0
    running egg_info
    creating pip-egg-info/apex.egg-info
    writing pip-egg-info/apex.egg-info/PKG-INFO
    writing dependency_links to pip-egg-info/apex.egg-info/dependency_links.txt
    writing top-level names to pip-egg-info/apex.egg-info/top_level.txt
    writing manifest file 'pip-egg-info/apex.egg-info/SOURCES.txt'
    reading manifest file 'pip-egg-info/apex.egg-info/SOURCES.txt'
    writing manifest file 'pip-egg-info/apex.egg-info/SOURCES.txt'
    /tmp/pip-req-build-chq189jg/setup.py:33: UserWarning: Option --pyprof not specified. Not installing PyProf dependencies!
      warnings.warn("Option --pyprof not specified. Not installing PyProf dependencies!")
  Source in /tmp/pip-req-build-chq189jg has version 0.1, which satisfies requirement apex==0.1 from file:///data/rsg/chemistry/yangk/apex
  Removed apex==0.1 from file:///data/rsg/chemistry/yangk/apex from build tracker '/tmp/pip-req-tracker-gdqvsxif'
Skipping bdist_wheel for apex, due to binaries being disabled for it.
Installing collected packages: apex
  Created temporary directory: /tmp/pip-record-6bdqijvs
    Running command /data/rsg/chemistry/yangk/conda/envs/cuda10/bin/python -u -c 'import setuptools, tokenize;__file__='"'"'/tmp/pip-req-build-chq189jg/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' --cpp_ext --cuda_ext install --record /tmp/pip-record-6bdqijvs/install-record.txt --single-version-externally-managed --compile
    torch.__version__  =  1.1.0
    /tmp/pip-req-build-chq189jg/setup.py:33: UserWarning: Option --pyprof not specified. Not installing PyProf dependencies!
      warnings.warn("Option --pyprof not specified. Not installing PyProf dependencies!")

    Compiling cuda extensions with
    nvcc: NVIDIA (R) Cuda compiler driver
    Copyright (c) 2005-2018 NVIDIA Corporation
    Built on Sat_Aug_25_21:08:01_CDT_2018
    Cuda compilation tools, release 10.0, V10.0.130
    from /usr/local/cuda/bin

    running install
    running build
    running build_py
    creating build
    creating build/lib.linux-x86_64-3.6
    creating build/lib.linux-x86_64-3.6/apex
    copying apex/__init__.py -> build/lib.linux-x86_64-3.6/apex
    creating build/lib.linux-x86_64-3.6/apex/contrib
    copying apex/contrib/__init__.py -> build/lib.linux-x86_64-3.6/apex/contrib
    creating build/lib.linux-x86_64-3.6/apex/reparameterization
    copying apex/reparameterization/weight_norm.py -> build/lib.linux-x86_64-3.6/apex/reparameterization
    copying apex/reparameterization/reparameterization.py -> build/lib.linux-x86_64-3.6/apex/reparameterization
    copying apex/reparameterization/__init__.py -> build/lib.linux-x86_64-3.6/apex/reparameterization
    creating build/lib.linux-x86_64-3.6/apex/normalization
    copying apex/normalization/fused_layer_norm.py -> build/lib.linux-x86_64-3.6/apex/normalization
    copying apex/normalization/__init__.py -> build/lib.linux-x86_64-3.6/apex/normalization
    creating build/lib.linux-x86_64-3.6/apex/optimizers
    copying apex/optimizers/fused_lamb.py -> build/lib.linux-x86_64-3.6/apex/optimizers
    copying apex/optimizers/fused_adam.py -> build/lib.linux-x86_64-3.6/apex/optimizers
    copying apex/optimizers/__init__.py -> build/lib.linux-x86_64-3.6/apex/optimizers
    copying apex/optimizers/fp16_optimizer.py -> build/lib.linux-x86_64-3.6/apex/optimizers
    copying apex/optimizers/fused_sgd.py -> build/lib.linux-x86_64-3.6/apex/optimizers
    copying apex/optimizers/fused_novograd.py -> build/lib.linux-x86_64-3.6/apex/optimizers
    creating build/lib.linux-x86_64-3.6/apex/parallel
    copying apex/parallel/sync_batchnorm.py -> build/lib.linux-x86_64-3.6/apex/parallel
    copying apex/parallel/LARC.py -> build/lib.linux-x86_64-3.6/apex/parallel
    copying apex/parallel/optimized_sync_batchnorm.py -> build/lib.linux-x86_64-3.6/apex/parallel
    copying apex/parallel/optimized_sync_batchnorm_kernel.py -> build/lib.linux-x86_64-3.6/apex/parallel
    copying apex/parallel/sync_batchnorm_kernel.py -> build/lib.linux-x86_64-3.6/apex/parallel
    copying apex/parallel/distributed.py -> build/lib.linux-x86_64-3.6/apex/parallel
    copying apex/parallel/multiproc.py -> build/lib.linux-x86_64-3.6/apex/parallel
    copying apex/parallel/__init__.py -> build/lib.linux-x86_64-3.6/apex/parallel
    creating build/lib.linux-x86_64-3.6/apex/multi_tensor_apply
    copying apex/multi_tensor_apply/__init__.py -> build/lib.linux-x86_64-3.6/apex/multi_tensor_apply
    copying apex/multi_tensor_apply/multi_tensor_apply.py -> build/lib.linux-x86_64-3.6/apex/multi_tensor_apply
    creating build/lib.linux-x86_64-3.6/apex/pyprof
    copying apex/pyprof/__init__.py -> build/lib.linux-x86_64-3.6/apex/pyprof
    creating build/lib.linux-x86_64-3.6/apex/fp16_utils
    copying apex/fp16_utils/fp16util.py -> build/lib.linux-x86_64-3.6/apex/fp16_utils
    copying apex/fp16_utils/__init__.py -> build/lib.linux-x86_64-3.6/apex/fp16_utils
    copying apex/fp16_utils/loss_scaler.py -> build/lib.linux-x86_64-3.6/apex/fp16_utils
    copying apex/fp16_utils/fp16_optimizer.py -> build/lib.linux-x86_64-3.6/apex/fp16_utils
    creating build/lib.linux-x86_64-3.6/apex/RNN
    copying apex/RNN/__init__.py -> build/lib.linux-x86_64-3.6/apex/RNN
    copying apex/RNN/models.py -> build/lib.linux-x86_64-3.6/apex/RNN
    copying apex/RNN/RNNBackend.py -> build/lib.linux-x86_64-3.6/apex/RNN
    copying apex/RNN/cells.py -> build/lib.linux-x86_64-3.6/apex/RNN
    creating build/lib.linux-x86_64-3.6/apex/amp
    copying apex/amp/amp.py -> build/lib.linux-x86_64-3.6/apex/amp
    copying apex/amp/handle.py -> build/lib.linux-x86_64-3.6/apex/amp
    copying apex/amp/_process_optimizer.py -> build/lib.linux-x86_64-3.6/apex/amp
    copying apex/amp/_amp_state.py -> build/lib.linux-x86_64-3.6/apex/amp
    copying apex/amp/scaler.py -> build/lib.linux-x86_64-3.6/apex/amp
    copying apex/amp/rnn_compat.py -> build/lib.linux-x86_64-3.6/apex/amp
    copying apex/amp/__init__.py -> build/lib.linux-x86_64-3.6/apex/amp
    copying apex/amp/_initialize.py -> build/lib.linux-x86_64-3.6/apex/amp
    copying apex/amp/__version__.py -> build/lib.linux-x86_64-3.6/apex/amp
    copying apex/amp/opt.py -> build/lib.linux-x86_64-3.6/apex/amp
    copying apex/amp/frontend.py -> build/lib.linux-x86_64-3.6/apex/amp
    copying apex/amp/utils.py -> build/lib.linux-x86_64-3.6/apex/amp
    copying apex/amp/compat.py -> build/lib.linux-x86_64-3.6/apex/amp
    copying apex/amp/wrap.py -> build/lib.linux-x86_64-3.6/apex/amp
    creating build/lib.linux-x86_64-3.6/apex/contrib/groupbn
    copying apex/contrib/groupbn/batch_norm.py -> build/lib.linux-x86_64-3.6/apex/contrib/groupbn
    copying apex/contrib/groupbn/__init__.py -> build/lib.linux-x86_64-3.6/apex/contrib/groupbn
    creating build/lib.linux-x86_64-3.6/apex/contrib/xentropy
    copying apex/contrib/xentropy/__init__.py -> build/lib.linux-x86_64-3.6/apex/contrib/xentropy
    copying apex/contrib/xentropy/softmax_xentropy.py -> build/lib.linux-x86_64-3.6/apex/contrib/xentropy
    creating build/lib.linux-x86_64-3.6/apex/pyprof/nvtx
    copying apex/pyprof/nvtx/nvmarker.py -> build/lib.linux-x86_64-3.6/apex/pyprof/nvtx
    copying apex/pyprof/nvtx/__init__.py -> build/lib.linux-x86_64-3.6/apex/pyprof/nvtx
    creating build/lib.linux-x86_64-3.6/apex/pyprof/prof
    copying apex/pyprof/prof/normalization.py -> build/lib.linux-x86_64-3.6/apex/pyprof/prof
    copying apex/pyprof/prof/convert.py -> build/lib.linux-x86_64-3.6/apex/pyprof/prof
    copying apex/pyprof/prof/prof.py -> build/lib.linux-x86_64-3.6/apex/pyprof/prof
    copying apex/pyprof/prof/activation.py -> build/lib.linux-x86_64-3.6/apex/pyprof/prof
    copying apex/pyprof/prof/pooling.py -> build/lib.linux-x86_64-3.6/apex/pyprof/prof
    copying apex/pyprof/prof/conv.py -> build/lib.linux-x86_64-3.6/apex/pyprof/prof
    copying apex/pyprof/prof/data.py -> build/lib.linux-x86_64-3.6/apex/pyprof/prof
    copying apex/pyprof/prof/randomSample.py -> build/lib.linux-x86_64-3.6/apex/pyprof/prof
    copying apex/pyprof/prof/__init__.py -> build/lib.linux-x86_64-3.6/apex/pyprof/prof
    copying apex/pyprof/prof/linear.py -> build/lib.linux-x86_64-3.6/apex/pyprof/prof
    copying apex/pyprof/prof/base.py -> build/lib.linux-x86_64-3.6/apex/pyprof/prof
    copying apex/pyprof/prof/recurrentCell.py -> build/lib.linux-x86_64-3.6/apex/pyprof/prof
    copying apex/pyprof/prof/misc.py -> build/lib.linux-x86_64-3.6/apex/pyprof/prof
    copying apex/pyprof/prof/blas.py -> build/lib.linux-x86_64-3.6/apex/pyprof/prof
    copying apex/pyprof/prof/optim.py -> build/lib.linux-x86_64-3.6/apex/pyprof/prof
    copying apex/pyprof/prof/dropout.py -> build/lib.linux-x86_64-3.6/apex/pyprof/prof
    copying apex/pyprof/prof/utility.py -> build/lib.linux-x86_64-3.6/apex/pyprof/prof
    copying apex/pyprof/prof/index_slice_join_mutate.py -> build/lib.linux-x86_64-3.6/apex/pyprof/prof
    copying apex/pyprof/prof/pointwise.py -> build/lib.linux-x86_64-3.6/apex/pyprof/prof
    copying apex/pyprof/prof/softmax.py -> build/lib.linux-x86_64-3.6/apex/pyprof/prof
    copying apex/pyprof/prof/usage.py -> build/lib.linux-x86_64-3.6/apex/pyprof/prof
    copying apex/pyprof/prof/embedding.py -> build/lib.linux-x86_64-3.6/apex/pyprof/prof
    copying apex/pyprof/prof/__main__.py -> build/lib.linux-x86_64-3.6/apex/pyprof/prof
    copying apex/pyprof/prof/output.py -> build/lib.linux-x86_64-3.6/apex/pyprof/prof
    copying apex/pyprof/prof/loss.py -> build/lib.linux-x86_64-3.6/apex/pyprof/prof
    copying apex/pyprof/prof/reduction.py -> build/lib.linux-x86_64-3.6/apex/pyprof/prof
    creating build/lib.linux-x86_64-3.6/apex/pyprof/parse
    copying apex/pyprof/parse/kernel.py -> build/lib.linux-x86_64-3.6/apex/pyprof/parse
    copying apex/pyprof/parse/__init__.py -> build/lib.linux-x86_64-3.6/apex/pyprof/parse
    copying apex/pyprof/parse/parse.py -> build/lib.linux-x86_64-3.6/apex/pyprof/parse
    copying apex/pyprof/parse/db.py -> build/lib.linux-x86_64-3.6/apex/pyprof/parse
    copying apex/pyprof/parse/nvvp.py -> build/lib.linux-x86_64-3.6/apex/pyprof/parse
    copying apex/pyprof/parse/__main__.py -> build/lib.linux-x86_64-3.6/apex/pyprof/parse
    creating build/lib.linux-x86_64-3.6/apex/amp/lists
    copying apex/amp/lists/tensor_overrides.py -> build/lib.linux-x86_64-3.6/apex/amp/lists
    copying apex/amp/lists/functional_overrides.py -> build/lib.linux-x86_64-3.6/apex/amp/lists
    copying apex/amp/lists/torch_overrides.py -> build/lib.linux-x86_64-3.6/apex/amp/lists
    copying apex/amp/lists/__init__.py -> build/lib.linux-x86_64-3.6/apex/amp/lists
    running build_ext
    building 'apex_C' extension
    creating build/temp.linux-x86_64-3.6
    creating build/temp.linux-x86_64-3.6/csrc
    gcc -pthread -B /data/rsg/chemistry/yangk/conda/envs/cuda10/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/data/rsg/chemistry/yangk/conda/envs/cuda10/lib/python3.6/site-packages/torch/include -I/data/rsg/chemistry/yangk/conda/envs/cuda10/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -I/data/rsg/chemistry/yangk/conda/envs/cuda10/lib/python3.6/site-packages/torch/include/TH -I/data/rsg/chemistry/yangk/conda/envs/cuda10/lib/python3.6/site-packages/torch/include/THC -I/data/rsg/chemistry/yangk/conda/envs/cuda10/include/python3.6m -c csrc/flatten_unflatten.cpp -o build/temp.linux-x86_64-3.6/csrc/flatten_unflatten.o -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=apex_C -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++11
    cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
    g++ -pthread -shared -B /data/rsg/chemistry/yangk/conda/envs/cuda10/compiler_compat -L/data/rsg/chemistry/yangk/conda/envs/cuda10/lib -Wl,-rpath=/data/rsg/chemistry/yangk/conda/envs/cuda10/lib -Wl,--no-as-needed -Wl,--sysroot=/ build/temp.linux-x86_64-3.6/csrc/flatten_unflatten.o -o build/lib.linux-x86_64-3.6/apex_C.cpython-36m-x86_64-linux-gnu.so
    building 'amp_C' extension
    gcc -pthread -B /data/rsg/chemistry/yangk/conda/envs/cuda10/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/data/rsg/chemistry/yangk/conda/envs/cuda10/lib/python3.6/site-packages/torch/include -I/data/rsg/chemistry/yangk/conda/envs/cuda10/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -I/data/rsg/chemistry/yangk/conda/envs/cuda10/lib/python3.6/site-packages/torch/include/TH -I/data/rsg/chemistry/yangk/conda/envs/cuda10/lib/python3.6/site-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/local/cuda/lib64/libcudnn.so/include -I/data/rsg/chemistry/yangk/conda/envs/cuda10/include/python3.6m -c csrc/amp_C_frontend.cpp -o build/temp.linux-x86_64-3.6/csrc/amp_C_frontend.o -O3 -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=amp_C -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++11
    cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
    cc1plus: error: /usr/local/cuda/lib64/libcudnn.so/include: Not a directory
    error: command 'gcc' failed with exit status 1
  Running setup.py install for apex ... error
Cleaning up...
  Removing source in /tmp/pip-req-build-chq189jg
Removed build tracker '/tmp/pip-req-tracker-gdqvsxif'
ERROR: Command "/data/rsg/chemistry/yangk/conda/envs/cuda10/bin/python -u -c 'import setuptools, tokenize;__file__='"'"'/tmp/pip-req-build-chq189jg/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' --cpp_ext --cuda_ext install --record /tmp/pip-record-6bdqijvs/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /tmp/pip-req-build-chq189jg/
Exception information:
Traceback (most recent call last):
  File "/data/rsg/chemistry/yangk/conda/envs/cuda10/lib/python3.6/site-packages/pip/_internal/cli/base_command.py", line 178, in main
    status = self.run(options, args)
  File "/data/rsg/chemistry/yangk/conda/envs/cuda10/lib/python3.6/site-packages/pip/_internal/commands/install.py", line 414, in run
    use_user_site=options.use_user_site,
  File "/data/rsg/chemistry/yangk/conda/envs/cuda10/lib/python3.6/site-packages/pip/_internal/req/__init__.py", line 58, in install_given_reqs
    **kwargs
  File "/data/rsg/chemistry/yangk/conda/envs/cuda10/lib/python3.6/site-packages/pip/_internal/req/req_install.py", line 953, in install
    spinner=spinner,
  File "/data/rsg/chemistry/yangk/conda/envs/cuda10/lib/python3.6/site-packages/pip/_internal/utils/misc.py", line 776, in call_subprocess
    % (command_desc, proc.returncode, cwd))
pip._internal.exceptions.InstallationError: Command "/data/rsg/chemistry/yangk/conda/envs/cuda10/bin/python -u -c 'import setuptools, tokenize;__file__='"'"'/tmp/pip-req-build-chq189jg/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' --cpp_ext --cuda_ext install --record /tmp/pip-record-6bdqijvs/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /tmp/pip-req-build-chq189jg/
1 location(s) to search for versions of pip:
* https://pypi.org/simple/pip/
Getting page https://pypi.org/simple/pip/
Starting new HTTPS connection (1): pypi.org:443
https://pypi.org:443 "GET /simple/pip/ HTTP/1.1" 200 11972
Analyzing links from page https://pypi.org/simple/pip/

After which is a bunch of messages of the form:

Found link https://files.pythonhosted.org/packages/3d/9d/1e313763bdfb6a48977b65829c6ce2a43eaae29ea2f907c8bbef024a7219/pip-0.2.tar.gz#sha256=88bb8d029e1bf4acd0e04d300104b7440086f94cc1ce1c5c3c31e3293aee1f81 (from https://pypi.org/simple/pip/), version: 0.2
Forrest-ht commented 4 years ago

@ptrblck how to solve the above issue? i met the same issue. thank you

mcarilli commented 4 years ago

Looks like Pytorch can't find the cudnn runtime libraries (libcudnn.so). You can either install cudnn or use a Docker container to provide a pre-built environment, as explained here.

tmbdev commented 4 years ago

I'm having the same problem. The cudnn runtime libraries are installed as part of Anaconda, and PyTorch itself is working fine. There may be a problem with the APEX installer looking in the wrong place for those libraries.

More importantly, though, the setup.py file starts out with:

from pip._internal import main as pipmain

Reaching into the internals of pip may not be a good idea since that's version dependent, and this may be related to the error messages mentioning "pip".

Either way, APEX is currently not particularly useful with Anaconda because the C++ extensions aren't installable.

lxtGH commented 4 years ago

Make your cuda-run time version the same with your pytorch-cuda compiled version.

croros commented 4 years ago

Take a look at #368