Open momo1986 opened 11 months ago
same issue
Similar issue:
My GPU version is also CUDA 12.2. Installing apex directly results in the same error as mentioned above.
Then I switched to a conda virtual environment with CUDA version 11.3. My Torch version corresponds to CUDA 11.3, which is PyTorch 1.10. After that, using pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./
installs apex successfully. However, when running the code, an error occurs:
Traceback (most recent call last):
File ".../VALOR/./train.py", line 88, in <module>
main(args)
File ".../VALOR/./train.py", line 55, in main
model = VALOR.from_pretrained(opts,checkpoint)
File ".../VALOR/model/modeling.py", line 109, in from_pretrained
model = cls(opts, *inputs, **kwargs)
File ".../VALOR/model/pretrain.py", line 67, in __init__
super().__init__(opts)
File ".../VALOR/model/modeling.py", line 328, in __init__
self.load_ast_model(base_cfg,config)
File ".../VALOR/model/modeling.py", line 609, in load_ast_model
self.audio_encoder = TransformerEncoder(model_cfg_audio, mode='prenorm')
File ".../VALOR/model/transformer.py", line 149, in __init__
layer = TransformerLayer(config, mode)
File ".../VALOR/model/transformer.py", line 62, in __init__
self.layernorm1 = LayerNorm(config.hidden_size, eps=1e-12)
File ".../anaconda3/envs/valor1/lib/python3.9/site-packages/apex/normalization/fused_layer_norm.py", line 268, in __init__
fused_layer_norm_cuda = importlib.import_module("fused_layer_norm_cuda")
File ".../anaconda3/envs/valor1/lib/python3.9/importlib/__init__.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
File "<frozen importlib._bootstrap>", line 984, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'fused_layer_norm_cuda'
'Readme' shows that we can use the option '--cuda_ext'
to install fused_layer_norm_cuda
, but that doesn't work.
same issue:
File "
I think you can remove the check code in setup.py, then use
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./
I've encountered the same issue. @Zhangwq76 could you tell which part of check code we should remove?
I've encountered the same issue. @Zhangwq76 could you tell which part of check code we should remove?
line 39, in check_cuda_torch_binary_vs_bare_metal
# raise RuntimeError(
# "Cuda extensions are being compiled with a version of Cuda that does "
# "not match the version used to compile Pytorch binaries. "
# "Pytorch binaries were compiled with Cuda {}.\n".format(torch.version.cuda)
# + "In some cases, a minor-version mismatch will not cause later errors: "
# "https://github.com/NVIDIA/apex/pull/323#discussion_r287021798. "
# "You can try commenting out this check (at your own risk)."
# )
+1
I've got a quick fix for this https://github.com/googio/apex based on @Zhangwq76 solution
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" git+https://github.com/googio/apex
same issue: File "", line 994, in _gcd_import File "", line 971, in _find_and_load File "", line 953, in _find_and_load_unlocked ModuleNotFoundError: No module named 'fused_layer_norm_cuda'
meet the same issue , do you solve it? i use the cuda 12.2 with torch2.1 and i modify the version check code in setup.py and use pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./ to install the apex, i install it successfully but when i use megatron i got error like not found 'fused_layer_norm_cuda'
Similar issue:
My GPU version is also CUDA 12.2. Installing apex directly results in the same error as mentioned above.
Then I switched to a conda virtual environment with CUDA version 11.3. My Torch version corresponds to CUDA 11.3, which is PyTorch 1.10. After that, using
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./
installs apex successfully. However, when running the code, an error occurs:Traceback (most recent call last): File ".../VALOR/./train.py", line 88, in
main(args) File ".../VALOR/./train.py", line 55, in main model = VALOR.from_pretrained(opts,checkpoint) File ".../VALOR/model/modeling.py", line 109, in from_pretrained model = cls(opts, *inputs, **kwargs) File ".../VALOR/model/pretrain.py", line 67, in init super().init(opts) File ".../VALOR/model/modeling.py", line 328, in init self.load_ast_model(base_cfg,config) File ".../VALOR/model/modeling.py", line 609, in load_ast_model self.audio_encoder = TransformerEncoder(model_cfg_audio, mode='prenorm') File ".../VALOR/model/transformer.py", line 149, in init layer = TransformerLayer(config, mode) File ".../VALOR/model/transformer.py", line 62, in init self.layernorm1 = LayerNorm(config.hidden_size, eps=1e-12) File ".../anaconda3/envs/valor1/lib/python3.9/site-packages/apex/normalization/fused_layer_norm.py", line 268, in init fused_layer_norm_cuda = importlib.import_module("fused_layer_norm_cuda") File ".../anaconda3/envs/valor1/lib/python3.9/importlib/init.py", line 127, in import_module return _bootstrap._gcd_import(name[level:], package, level) File " ", line 1030, in _gcd_import File " ", line 1007, in _find_and_load File " ", line 984, in _find_and_load_unlocked ModuleNotFoundError: No module named 'fused_layer_norm_cuda' 'Readme' shows that we can use the option '--cuda_ext'
to installfused_layer_norm_cuda
, but that doesn't work.
meet the same issue , do you solve it? i use the cuda 12.2 with torch2.1 and i modify the version check code in setup.py and use pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./ to install the apex, i install it successfully but when i use megatron i got error like not found 'fused_layer_norm_cuda'
Describe the Bug
Minimal Steps/Code to Reproduce the Bug running script: "python setup.py install --cpp_ext --cuda_ext"
The reporting log: "torch.version = 2.1.2+cu121
Compiling cuda extensions with nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2021 NVIDIA Corporation Built on Thu_Nov_18_09:45:30_PST_2021 Cuda compilation tools, release 11.5, V11.5.119 Build cuda_11.5.r11.5/compiler.30672275_0 from /usr/bin
Traceback (most recent call last): File "/home/hwq/ray/adversarial_examples/apex/setup.py", line 178, in
check_cuda_torch_binary_vs_bare_metal(CUDA_HOME)
File "/home/hwq/ray/adversarial_examples/apex/setup.py", line 40, in check_cuda_torch_binary_vs_bare_metal
raise RuntimeError(
RuntimeError: Cuda extensions are being compiled with a version of Cuda that does not match the version used to compile Pytorch binaries. Pytorch binaries were compiled with Cuda 12.1.
In some cases, a minor-version mismatch will not cause later errors: https://github.com/NVIDIA/apex/pull/323#discussion_r287021798. You can try commenting out this check (at your own risk)."
CUDA Version is 12.2.
Expected Behavior Install apex successfully Environment uname -a Linux ps 6.2.0-36-generic #37~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Mon Oct 9 15:34:04 UTC 2 x86_64 x86_64 x86_64 GNU/Linux nvidia-smi Fri Dec 22 00:15:43 2023
+---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |