Closed zsxkib closed 1 year ago
Btw I get the same error when I run:
python gen_video.py --output=lerp.mp4 --trunc=1 --seeds=0-31 --grid=4x2 \\
--network=https://api.ngc.nvidia.com/v2/models/nvidia/research/stylegan3/versions/1/files/stylegan3-r-afhqv2-512x512.pkl
I managed to fix the error
TLDR: Change line 332 of gen_videos.py
FROM device = torch.device('cuda:1')
TO device = torch.device('cuda:0')
If you're running deep learning models on Google Colab and encounter a CUDA error stating "invalid device ordinal", this article provides an easy fix. The problem surfaced while generating videos using a pre-trained model. The root of the issue lies in the following line of code:
device = torch.device('cuda:1')
The resulting error log was as follows:
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
This error arises when the script requests a non-existent CUDA device. In this instance, the CUDA device 'cuda:1' is specified, which usually points to the second GPU device. However, Google Colab typically allocates a single GPU, so when the script attempts to access 'cuda:1' (i.e., a second GPU), it encounters an error due to the absence of the device.
The solution requires us to adjust the CUDA device reference to target the correct, available GPU. As Google Colab typically offers only one GPU, the device should be set to 'cuda:0' instead of 'cuda:1'. 'cuda:0' corresponds to the first (and usually the only) GPU available on Google Colab. Therefore, change line 332 of gen_videos.py
from:
device = torch.device('cuda:1')
to:
device = torch.device('cuda:0')
This straightforward adjustment directs the script to the available GPU, effectively resolving the "CUDA error: invalid device ordinal". It's crucial to correctly assign the 'cuda:n' identifier to match the intended GPU ordinal, especially when multiple GPUs are accessible.
After making this change and rerunning the script, the model successfully generated videos using the pre-trained model without any CUDA errors.
In conclusion, correctly specifying the CUDA device is essential when running deep learning models to ensure that computations occur on the intended GPU. This small tweak should effectively resolve the issue in most Google Colab setups.
TLDR: Change line 332 of gen_videos.py
from device = torch.device('cuda:1')
to device = torch.device('cuda:0')
to resolve the "CUDA error: invalid device ordinal".
@zsxkib I have also did change device = torch.device('cuda:0')
everywhere in the code, but it new errors appeared again. Have you encountered something like this after changing to cuda:0
?
Setting up PyTorch plugin "bias_act_plugin"... Failed! Traceback (most recent call last): File "/home/yulduz/anaconda3/envs/panohead/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1740, in _run_ninja_build subprocess.run( File "/home/yulduz/anaconda3/envs/panohead/lib/python3.9/subprocess.py", line 528, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1. The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/media/yulduz/hdd/Projects/PanoHead/gen_videos.py", line 370, in <module> generate_images() # pylint: disable=no-value-for-parameter File "/home/yulduz/anaconda3/envs/panohead/lib/python3.9/site-packages/click/core.py", line 1157, in __call__ return self.main(*args, **kwargs) File "/home/yulduz/anaconda3/envs/panohead/lib/python3.9/site-packages/click/core.py", line 1078, in main rv = self.invoke(ctx) File "/home/yulduz/anaconda3/envs/panohead/lib/python3.9/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, **ctx.params) File "/home/yulduz/anaconda3/envs/panohead/lib/python3.9/site-packages/click/core.py", line 783, in invoke return __callback(*args, **kwargs) File "/media/yulduz/hdd/Projects/PanoHead/gen_videos.py", line 359, in generate_images gen_interp_video(G=G, mp4=output, pose_cond = pose_cond, bitrate='100M', grid_dims=grid, num_keyframes=num_keyframes, w_frames=w_frames, seeds=seeds, shuffle_seed=shuffle_seed, psi=truncation_psi, truncation_cutoff=truncation_cutoff, cfg=cfg, image_mode=image_mode, gen_shapes=shapes, device=device) File "/media/yulduz/hdd/Projects/PanoHead/gen_videos.py", line 94, in gen_interp_video ws = G.mapping(z=zs, c=c, truncation_psi=psi, truncation_cutoff=truncation_cutoff) File "/media/yulduz/hdd/Projects/PanoHead/training/triplane.py", line 56, in mapping return self.backbone.mapping(z, c * self.rendering_kwargs.get('c_scale', 0), truncation_psi=truncation_psi, truncation_cutoff=truncation_cutoff, update_emas=update_emas) File "/home/yulduz/anaconda3/envs/panohead/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/media/yulduz/hdd/Projects/PanoHead/training/networks_stylegan2.py", line 252, in forward x = layer(x) File "/home/yulduz/anaconda3/envs/panohead/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/media/yulduz/hdd/Projects/PanoHead/training/networks_stylegan2.py", line 126, in forward x = bias_act.bias_act(x, b, act=self.activation) File "/media/yulduz/hdd/Projects/PanoHead/torch_utils/ops/bias_act.py", line 86, in bias_act if impl == 'cuda' and x.device.type == 'cuda' and _init(): File "/media/yulduz/hdd/Projects/PanoHead/torch_utils/ops/bias_act.py", line 43, in _init _plugin = custom_ops.get_plugin( File "/media/yulduz/hdd/Projects/PanoHead/torch_utils/custom_ops.py", line 138, in get_plugin torch.utils.cpp_extension.load(name=module_name, build_directory=cached_build_dir, File "/home/yulduz/anaconda3/envs/panohead/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1144, in load return _jit_compile( File "/home/yulduz/anaconda3/envs/panohead/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1357, in _jit_compile _write_ninja_file_and_build_library( File "/home/yulduz/anaconda3/envs/panohead/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1469, in _write_ninja_file_and_build_library _run_ninja_build( File "/home/yulduz/anaconda3/envs/panohead/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1756, in _run_ninja_build raise RuntimeError(message) from e RuntimeError: Error building extension 'bias_act_plugin': [1/2] /usr/lib/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=bias_act_plugin -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/yulduz/anaconda3/envs/panohead/lib/python3.9/site-packages/torch/include -isystem /home/yulduz/anaconda3/envs/panohead/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /home/yulduz/anaconda3/envs/panohead/lib/python3.9/site-packages/torch/include/TH -isystem /home/yulduz/anaconda3/envs/panohead/lib/python3.9/site-packages/torch/include/THC -isystem /usr/lib/cuda/include -isystem /home/yulduz/anaconda3/envs/panohead/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' --use_fast_math -std=c++14 -c /home/yulduz/.cache/torch_extensions/py39_cu111/bias_act_plugin/b46266ff65f9fa53c32108953a1c6f16-nvidia-geforce-rtx-3080/bias_act.cu -o bias_act.cuda.o FAILED: bias_act.cuda.o /usr/lib/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=bias_act_plugin -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/yulduz/anaconda3/envs/panohead/lib/python3.9/site-packages/torch/include -isystem /home/yulduz/anaconda3/envs/panohead/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /home/yulduz/anaconda3/envs/panohead/lib/python3.9/site-packages/torch/include/TH -isystem /home/yulduz/anaconda3/envs/panohead/lib/python3.9/site-packages/torch/include/THC -isystem /usr/lib/cuda/include -isystem /home/yulduz/anaconda3/envs/panohead/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' --use_fast_math -std=c++14 -c /home/yulduz/.cache/torch_extensions/py39_cu111/bias_act_plugin/b46266ff65f9fa53c32108953a1c6f16-nvidia-geforce-rtx-3080/bias_act.cu -o bias_act.cuda.o /usr/include/c++/11/bits/std_function.h:435:145: error: parameter packs not expanded with ‘...’: 435 | function(_Functor&& __f) | ^ /usr/include/c++/11/bits/std_function.h:435:145: note: ‘_ArgTypes’ /usr/include/c++/11/bits/std_function.h:530:146: error: parameter packs not expanded with ‘...’: 530 | operator=(_Functor&& __f) | ^ /usr/include/c++/11/bits/std_function.h:530:146: note: ‘_ArgTypes’ ninja: build stopped: subcommand failed.
No I have never seen this error before @Zvyozdo4ka
You could try @camenduru's https://github.com/camenduru/PanoHead-colab
@zsxkib Thank you so much for colab version. I shall try it
I'm on Google Colab:
This all works. But when I run the following command I get a weird CUDA error. This is running on one T4 GPU:
P.S. I have all the files required