NVIDIA / apex

A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch
BSD 3-Clause "New" or "Revised" License
8.19k stars 1.36k forks source link

bugs after apex installation #187

Open yinwenpeng opened 5 years ago

yinwenpeng commented 5 years ago

When I tried the "Quick Start" : $ pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" .

my program shows this error :

ImportError: ......./miniconda3/lib/python3.6/site-packages/fused_layer_norm_cuda.cpython-36m-x86_64-linux-gnu.so: undefined symbol: __cudaPopCallConfiguration

while if I tried "pip install -v --no-cache-dir .", the error becomes:

ModuleNotFoundError: No module named 'fused_layer_norm_cuda'

I am using pytorch 1.0, cuda 9.2. No idea what's wrong here. thanks

mcarilli commented 5 years ago

Are you installing within a container, or on bare metal? Either way, this could be due to a lingering previous install on your system.

It might be worth trying a clean uninstall

pip uninstall apex; 
pip uninstall apex; # (repeat until it says Skipping apex as it is not installed, 
                    # because if you also installed using the old `python setup.py install`, 
                    # you may also have the old files installed at a different location)

then

cd apex_repo;
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" .
yinwenpeng commented 5 years ago

thanks, it was installed without a container i think. I tried the "uninstall" multiple times, but do not help. The problem still exists

mcarilli commented 5 years ago

@thorjohnsen Have you seen this error before?

mcarilli commented 5 years ago

One random possibility that occurs to me is that you are somehow compiling with a version of nvcc that is different from the cuda runtime library that the application is attempting to load when you execute it. Can you give me the results of these three commands:

nvcc --version
which nvcc
echo $LD_LIBRARY_PATH

Also, add print(torch.utils.cpp_extension.CUDA_HOME) here https://github.com/NVIDIA/apex/blob/master/setup.py#L38 then run the install and see what it prints, so we know where the install script itself is looking to find nvcc.

yinwenpeng commented 5 years ago

Thanks. It shows as follows:

nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2018 NVIDIA Corporation Built on Tue_Jun_12_23:07:04_CDT_2018 Cuda compilation tools, release 9.2, V9.2.148

which nvcc

/usr/local/cuda/bin/nvcc

echo $LD_LIBRARY_PATH :/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cudnn/lib64

add print(torch.utils.cpp_extension.CUDA_HOME) here https://github.com/NVIDIA/apex/blob/master/setup.py#L38

/usr/local/cuda

So, what's wrong there? thanks

mcarilli commented 5 years ago

That looks reasonable...My suspicion is that the version of Pytorch installed on your system does not match the version of Cuda installed on your system. Can you also print torch.version.cuda right next to print(torch.utils.cpp_extension.CUDA_HOME) at line 38 of setup.py?

Other people have had similar issues with extensions: https://github.com/jwyang/faster-rcnn.pytorch/issues/190 https://github.com/open-mmlab/mmdetection/issues/66#issuecomment-434165962 This one looks like the most helpful: https://github.com/rusty1s/pytorch_scatter/issues/19 https://github.com/rusty1s/pytorch_scatter/issues/19#issuecomment-449735614

Maybe you can fix the issue by uninstalling pytorch (run pip uninstall torch repeatedly until it says torch is not installed), uninstalling apex (run pip uninstall apex repeatedly until it says it's not installed), then either rebuilding Pytorch from source, or conda installing again and making sure it matches the version of Cuda you have on bare metal. Afterwards, reinstall Apex and see if it works. Sorry for the annoyance but like I said, this seems to be an issue other people have had, and it does not seem like an issue with Apex in particular.

sullyinc commented 3 months ago

Encountered this too, even after reinstalling cuda with matching nvcc and torch.version.cuda versions. Given that pip --version on my machine was version 23.0.1, I was using the pip command listed in the README for pip < 23.1:

# if pip >= 23.1 [...]
[...]
# otherwise
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --global-option="--cpp_ext" --global-option="--cuda_ext" ./

However, this would strangely result in a python-only build, without compiling the C sources. The install would still succeed and display

Successfully built apex
Installing collected packages: apex

However, attempting to import and use it in running application code would give the error:

ModuleNotFoundError: No module named 'fused_layer_norm_cuda'

Inspecting the pip install more closely, this warning appeared near the top:

WARNING: Implying --no-binary=:all: due to the presence of --build-option / --global-option / --install-option. Consider using --config-settings for more flexibility.

After trial and error, I tried the other install command from the README meant for pip >= 23.1, and that worked. Both python and C sources were compiled and importable/usable from running application code.

# if pip >= 23.1 (ref: https://pip.pypa.io/en/stable/news/#v23-1) which supports multiple `--config-settings` with the same key... 
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./

I haven't dug into why there are separate install instructions for pip <=> 23.1, but that might need another look. Let me know if I can provide other info to help.