NVIDIA / apex

A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch
BSD 3-Clause "New" or "Revised" License
8.42k stars 1.4k forks source link

Cannot install apex on the machine of CUDA 12.2 #1761

Open momo1986 opened 11 months ago

momo1986 commented 11 months ago

Describe the Bug

Minimal Steps/Code to Reproduce the Bug running script: "python setup.py install --cpp_ext --cuda_ext"

The reporting log: "torch.version = 2.1.2+cu121

Compiling cuda extensions with nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2021 NVIDIA Corporation Built on Thu_Nov_18_09:45:30_PST_2021 Cuda compilation tools, release 11.5, V11.5.119 Build cuda_11.5.r11.5/compiler.30672275_0 from /usr/bin

Traceback (most recent call last): File "/home/hwq/ray/adversarial_examples/apex/setup.py", line 178, in check_cuda_torch_binary_vs_bare_metal(CUDA_HOME) File "/home/hwq/ray/adversarial_examples/apex/setup.py", line 40, in check_cuda_torch_binary_vs_bare_metal raise RuntimeError( RuntimeError: Cuda extensions are being compiled with a version of Cuda that does not match the version used to compile Pytorch binaries. Pytorch binaries were compiled with Cuda 12.1. In some cases, a minor-version mismatch will not cause later errors: https://github.com/NVIDIA/apex/pull/323#discussion_r287021798. You can try commenting out this check (at your own risk)."

CUDA Version is 12.2.

Expected Behavior Install apex successfully Environment uname -a Linux ps 6.2.0-36-generic #37~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Mon Oct 9 15:34:04 UTC 2 x86_64 x86_64 x86_64 GNU/Linux nvidia-smi Fri Dec 22 00:15:43 2023
+---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |

foreverpiano commented 9 months ago

same issue

caseclose commented 9 months ago

Similar issue:

My GPU version is also CUDA 12.2. Installing apex directly results in the same error as mentioned above.

Then I switched to a conda virtual environment with CUDA version 11.3. My Torch version corresponds to CUDA 11.3, which is PyTorch 1.10. After that, using pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./ installs apex successfully. However, when running the code, an error occurs:

Traceback (most recent call last):
  File ".../VALOR/./train.py", line 88, in <module>
    main(args)
  File ".../VALOR/./train.py", line 55, in main
    model = VALOR.from_pretrained(opts,checkpoint)
  File ".../VALOR/model/modeling.py", line 109, in from_pretrained
    model = cls(opts, *inputs, **kwargs)
  File ".../VALOR/model/pretrain.py", line 67, in __init__
    super().__init__(opts)
  File ".../VALOR/model/modeling.py", line 328, in __init__
    self.load_ast_model(base_cfg,config)
  File ".../VALOR/model/modeling.py", line 609, in load_ast_model
    self.audio_encoder = TransformerEncoder(model_cfg_audio, mode='prenorm')
  File ".../VALOR/model/transformer.py", line 149, in __init__
    layer = TransformerLayer(config, mode)
  File ".../VALOR/model/transformer.py", line 62, in __init__
    self.layernorm1 = LayerNorm(config.hidden_size, eps=1e-12)
  File ".../anaconda3/envs/valor1/lib/python3.9/site-packages/apex/normalization/fused_layer_norm.py", line 268, in __init__
    fused_layer_norm_cuda = importlib.import_module("fused_layer_norm_cuda")
  File ".../anaconda3/envs/valor1/lib/python3.9/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 984, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'fused_layer_norm_cuda'

'Readme' shows that we can use the option '--cuda_ext' to install fused_layer_norm_cuda, but that doesn't work.

Tsuki0125 commented 8 months ago

same issue: File "", line 994, in _gcd_import File "", line 971, in _find_and_load File "", line 953, in _find_and_load_unlocked ModuleNotFoundError: No module named 'fused_layer_norm_cuda'

Zhangwq76 commented 6 months ago

I think you can remove the check code in setup.py, then use pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./

adafok commented 6 months ago

I've encountered the same issue. @Zhangwq76 could you tell which part of check code we should remove?

Zhangwq76 commented 6 months ago

I've encountered the same issue. @Zhangwq76 could you tell which part of check code we should remove?

line 39, in check_cuda_torch_binary_vs_bare_metal

if (bare_metal_version != torch_binary_version):

#     raise RuntimeError(
#         "Cuda extensions are being compiled with a version of Cuda that does "
#         "not match the version used to compile Pytorch binaries.  "
#         "Pytorch binaries were compiled with Cuda {}.\n".format(torch.version.cuda)
#         + "In some cases, a minor-version mismatch will not cause later errors:  "
#         "https://github.com/NVIDIA/apex/pull/323#discussion_r287021798.  "
#         "You can try commenting out this check (at your own risk)."
#     )
yachty66 commented 4 months ago

+1

googio commented 3 months ago

I've got a quick fix for this https://github.com/googio/apex based on @Zhangwq76 solution

pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" git+https://github.com/googio/apex

AEProgrammer commented 2 months ago

same issue: File "", line 994, in _gcd_import File "", line 971, in _find_and_load File "", line 953, in _find_and_load_unlocked ModuleNotFoundError: No module named 'fused_layer_norm_cuda'

meet the same issue , do you solve it? i use the cuda 12.2 with torch2.1 and i modify the version check code in setup.py and use pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./ to install the apex, i install it successfully but when i use megatron i got error like not found 'fused_layer_norm_cuda'

AEProgrammer commented 2 months ago

Similar issue:

My GPU version is also CUDA 12.2. Installing apex directly results in the same error as mentioned above.

Then I switched to a conda virtual environment with CUDA version 11.3. My Torch version corresponds to CUDA 11.3, which is PyTorch 1.10. After that, using pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./ installs apex successfully. However, when running the code, an error occurs:

Traceback (most recent call last): File ".../VALOR/./train.py", line 88, in main(args) File ".../VALOR/./train.py", line 55, in main model = VALOR.from_pretrained(opts,checkpoint) File ".../VALOR/model/modeling.py", line 109, in from_pretrained model = cls(opts, *inputs, **kwargs) File ".../VALOR/model/pretrain.py", line 67, in init super().init(opts) File ".../VALOR/model/modeling.py", line 328, in init self.load_ast_model(base_cfg,config) File ".../VALOR/model/modeling.py", line 609, in load_ast_model self.audio_encoder = TransformerEncoder(model_cfg_audio, mode='prenorm') File ".../VALOR/model/transformer.py", line 149, in init layer = TransformerLayer(config, mode) File ".../VALOR/model/transformer.py", line 62, in init self.layernorm1 = LayerNorm(config.hidden_size, eps=1e-12) File ".../anaconda3/envs/valor1/lib/python3.9/site-packages/apex/normalization/fused_layer_norm.py", line 268, in init fused_layer_norm_cuda = importlib.import_module("fused_layer_norm_cuda") File ".../anaconda3/envs/valor1/lib/python3.9/importlib/init.py", line 127, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "", line 1030, in _gcd_import File "", line 1007, in _find_and_load File "", line 984, in _find_and_load_unlocked ModuleNotFoundError: No module named 'fused_layer_norm_cuda' 'Readme' shows that we can use the option '--cuda_ext' to install fused_layer_norm_cuda, but that doesn't work.

meet the same issue , do you solve it? i use the cuda 12.2 with torch2.1 and i modify the version check code in setup.py and use pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./ to install the apex, i install it successfully but when i use megatron i got error like not found 'fused_layer_norm_cuda'