SizheAn / PanoHead

Code Repository for CVPR 2023 Paper "PanoHead: Geometry-Aware 3D Full-Head Synthesis in 360 degree"
Creative Commons Zero v1.0 Universal
1.91k stars 236 forks source link

How Do I Get This to Run on Colab? #8

Closed zsxkib closed 1 year ago

zsxkib commented 1 year ago

I'm on Google Colab:

%%shell
git clone https://github.com/zsxkib/replicate-pano-head.git

ls

pip install numpy click pillow scipy torch requests tqdm ninja matplotlib imageio imgui glfw pyopengl imageio-ffmpeg pyspng psutil mrcfile tensorboard torchvision

cd replicate-pano-head

mkdir "/content/replicate-pano-head/models/"

FILE_ID=1FqvQzICV1H4fbQaz8BiWxtiRYxJd4T8N
DEST_PATH="/content/replicate-pano-head/models/easy-khair-180-gpc0.8-trans10-025000.pkl"
wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=${FILE_ID}' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=${FILE_ID}" -O ${DEST_PATH} && rm -rf /tmp/cookies.txt

This all works. But when I run the following command I get a weird CUDA error. This is running on one T4 GPU:

python /content/replicate-pano-head/gen_videos.py --network /content/replicate-pano-head/models/easy-khair-180-gpc0.8-trans10-025000.pkl --seeds 0
Loading networks from "/content/replicate-pano-head/models/easy-khair-180-gpc0.8-trans10-025000.pkl"...
Traceback (most recent call last):
  File "/content/replicate-pano-head/gen_videos.py", line 371, in <module>
    generate_images() # pylint: disable=no-value-for-parameter
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/content/replicate-pano-head/gen_videos.py", line 334, in generate_images
    G = legacy.load_network_pkl(f)['G_ema'].to(device) # type: ignore
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1145, in to
    return self._apply(convert)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  [Previous line repeated 2 more times]
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 820, in _apply
    param_applied = fn(param)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1143, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

P.S. I have all the files required

zsxkib commented 1 year ago

Btw I get the same error when I run:

python gen_video.py --output=lerp.mp4 --trunc=1 --seeds=0-31 --grid=4x2 \\
        --network=https://api.ngc.nvidia.com/v2/models/nvidia/research/stylegan3/versions/1/files/stylegan3-r-afhqv2-512x512.pkl
zsxkib commented 1 year ago

I managed to fix the error

TLDR: Change line 332 of gen_videos.py FROM device = torch.device('cuda:1') TO device = torch.device('cuda:0')


Successfully Fixed "CUDA error: invalid device ordinal" in Google Colab

If you're running deep learning models on Google Colab and encounter a CUDA error stating "invalid device ordinal", this article provides an easy fix. The problem surfaced while generating videos using a pre-trained model. The root of the issue lies in the following line of code:

device = torch.device('cuda:1')

The resulting error log was as follows:

RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Diagnosing the Problem

This error arises when the script requests a non-existent CUDA device. In this instance, the CUDA device 'cuda:1' is specified, which usually points to the second GPU device. However, Google Colab typically allocates a single GPU, so when the script attempts to access 'cuda:1' (i.e., a second GPU), it encounters an error due to the absence of the device.

Solution and Implementation

The solution requires us to adjust the CUDA device reference to target the correct, available GPU. As Google Colab typically offers only one GPU, the device should be set to 'cuda:0' instead of 'cuda:1'. 'cuda:0' corresponds to the first (and usually the only) GPU available on Google Colab. Therefore, change line 332 of gen_videos.py from:

device = torch.device('cuda:1')

to:

device = torch.device('cuda:0')

This straightforward adjustment directs the script to the available GPU, effectively resolving the "CUDA error: invalid device ordinal". It's crucial to correctly assign the 'cuda:n' identifier to match the intended GPU ordinal, especially when multiple GPUs are accessible.

After making this change and rerunning the script, the model successfully generated videos using the pre-trained model without any CUDA errors.

In conclusion, correctly specifying the CUDA device is essential when running deep learning models to ensure that computations occur on the intended GPU. This small tweak should effectively resolve the issue in most Google Colab setups.

TLDR: Change line 332 of gen_videos.py from device = torch.device('cuda:1') to device = torch.device('cuda:0') to resolve the "CUDA error: invalid device ordinal".

Zvyozdo4ka commented 6 months ago

@zsxkib I have also did change device = torch.device('cuda:0') everywhere in the code, but it new errors appeared again. Have you encountered something like this after changing to cuda:0?

Setting up PyTorch plugin "bias_act_plugin"... Failed!
Traceback (most recent call last):
  File "/home/yulduz/anaconda3/envs/panohead/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1740, in _run_ninja_build
    subprocess.run(
  File "/home/yulduz/anaconda3/envs/panohead/lib/python3.9/subprocess.py", line 528, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/media/yulduz/hdd/Projects/PanoHead/gen_videos.py", line 370, in <module>
    generate_images() # pylint: disable=no-value-for-parameter
  File "/home/yulduz/anaconda3/envs/panohead/lib/python3.9/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/home/yulduz/anaconda3/envs/panohead/lib/python3.9/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/home/yulduz/anaconda3/envs/panohead/lib/python3.9/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/yulduz/anaconda3/envs/panohead/lib/python3.9/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/media/yulduz/hdd/Projects/PanoHead/gen_videos.py", line 359, in generate_images
    gen_interp_video(G=G, mp4=output, pose_cond = pose_cond, bitrate='100M', grid_dims=grid, num_keyframes=num_keyframes, w_frames=w_frames, seeds=seeds, shuffle_seed=shuffle_seed, psi=truncation_psi, truncation_cutoff=truncation_cutoff, cfg=cfg, image_mode=image_mode, gen_shapes=shapes, device=device)
  File "/media/yulduz/hdd/Projects/PanoHead/gen_videos.py", line 94, in gen_interp_video
    ws = G.mapping(z=zs, c=c, truncation_psi=psi, truncation_cutoff=truncation_cutoff)
  File "/media/yulduz/hdd/Projects/PanoHead/training/triplane.py", line 56, in mapping
    return self.backbone.mapping(z, c * self.rendering_kwargs.get('c_scale', 0), truncation_psi=truncation_psi, truncation_cutoff=truncation_cutoff, update_emas=update_emas)
  File "/home/yulduz/anaconda3/envs/panohead/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/media/yulduz/hdd/Projects/PanoHead/training/networks_stylegan2.py", line 252, in forward
    x = layer(x)
  File "/home/yulduz/anaconda3/envs/panohead/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/media/yulduz/hdd/Projects/PanoHead/training/networks_stylegan2.py", line 126, in forward
    x = bias_act.bias_act(x, b, act=self.activation)
  File "/media/yulduz/hdd/Projects/PanoHead/torch_utils/ops/bias_act.py", line 86, in bias_act
    if impl == 'cuda' and x.device.type == 'cuda' and _init():
  File "/media/yulduz/hdd/Projects/PanoHead/torch_utils/ops/bias_act.py", line 43, in _init
    _plugin = custom_ops.get_plugin(
  File "/media/yulduz/hdd/Projects/PanoHead/torch_utils/custom_ops.py", line 138, in get_plugin
    torch.utils.cpp_extension.load(name=module_name, build_directory=cached_build_dir,
  File "/home/yulduz/anaconda3/envs/panohead/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1144, in load
    return _jit_compile(
  File "/home/yulduz/anaconda3/envs/panohead/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1357, in _jit_compile
    _write_ninja_file_and_build_library(
  File "/home/yulduz/anaconda3/envs/panohead/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1469, in _write_ninja_file_and_build_library
    _run_ninja_build(
  File "/home/yulduz/anaconda3/envs/panohead/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1756, in _run_ninja_build
    raise RuntimeError(message) from e
RuntimeError: Error building extension 'bias_act_plugin': [1/2] /usr/lib/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=bias_act_plugin -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/yulduz/anaconda3/envs/panohead/lib/python3.9/site-packages/torch/include -isystem /home/yulduz/anaconda3/envs/panohead/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /home/yulduz/anaconda3/envs/panohead/lib/python3.9/site-packages/torch/include/TH -isystem /home/yulduz/anaconda3/envs/panohead/lib/python3.9/site-packages/torch/include/THC -isystem /usr/lib/cuda/include -isystem /home/yulduz/anaconda3/envs/panohead/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' --use_fast_math -std=c++14 -c /home/yulduz/.cache/torch_extensions/py39_cu111/bias_act_plugin/b46266ff65f9fa53c32108953a1c6f16-nvidia-geforce-rtx-3080/bias_act.cu -o bias_act.cuda.o 
FAILED: bias_act.cuda.o 
/usr/lib/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=bias_act_plugin -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/yulduz/anaconda3/envs/panohead/lib/python3.9/site-packages/torch/include -isystem /home/yulduz/anaconda3/envs/panohead/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /home/yulduz/anaconda3/envs/panohead/lib/python3.9/site-packages/torch/include/TH -isystem /home/yulduz/anaconda3/envs/panohead/lib/python3.9/site-packages/torch/include/THC -isystem /usr/lib/cuda/include -isystem /home/yulduz/anaconda3/envs/panohead/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' --use_fast_math -std=c++14 -c /home/yulduz/.cache/torch_extensions/py39_cu111/bias_act_plugin/b46266ff65f9fa53c32108953a1c6f16-nvidia-geforce-rtx-3080/bias_act.cu -o bias_act.cuda.o 
/usr/include/c++/11/bits/std_function.h:435:145: error: parameter packs not expanded with ‘...’:
  435 |         function(_Functor&& __f)
      |                                                                                                                                                 ^ 
/usr/include/c++/11/bits/std_function.h:435:145: note:         ‘_ArgTypes’
/usr/include/c++/11/bits/std_function.h:530:146: error: parameter packs not expanded with ‘...’:
  530 |         operator=(_Functor&& __f)
      |                                                                                                                                                  ^ 
/usr/include/c++/11/bits/std_function.h:530:146: note:         ‘_ArgTypes’
ninja: build stopped: subcommand failed.

zsxkib commented 6 months ago

No I have never seen this error before @Zvyozdo4ka

zsxkib commented 6 months ago

You could try @camenduru's https://github.com/camenduru/PanoHead-colab

Zvyozdo4ka commented 6 months ago

@zsxkib Thank you so much for colab version. I shall try it