XPixelGroup / BasicSR

Open Source Image and Video Restoration Toolbox for Super-resolution, Denoise, Deblurring, etc. Currently, it includes EDSR, RCAN, SRResNet, SRGAN, ESRGAN, EDVR, BasicVSR, SwinIR, ECBSR, etc. Also support StyleGAN2, DFDNet.
https://basicsr.readthedocs.io/en/latest/
Apache License 2.0
6.9k stars 1.2k forks source link

Error building extensions #699

Open lightandshadow68 opened 3 months ago

lightandshadow68 commented 3 months ago

I'm attempting to install BasicSR via pip and conda.

After setting the environment variable to enable extension compilation, when I build I receive the following error.

 The detected CUDA version (11.5) mismatches the version that was used to compile
  PyTorch (12.1). Please make sure to use the same CUDA versions.

However, when I check the versions of the drivers and CUDA installed, I receive....

(ml) ubuntu@ip-10-0-83-223:~$ nvidia-smi
Thu Aug 22 18:07:23 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       Off | 00000000:00:1E.0 Off |                    0 |
| N/A   26C    P8               9W /  70W |      2MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
(ml) ubuntu@ip-10-0-83-223:~$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0

What's odd is that nvcc and Nvidia-smi do not seem to agree on the version of CUDA installed, or it's referring to the toolkit version, which is different than the actual CUDA api?

lightandshadow68 commented 3 months ago

Looks like this is due to manually installing NVDIA drivers and CUDA Toolkit on an EC2 created from a base AMI.

When I use a Ubuntu PyTorch AMI to create an EC2 instance, nvcc matches, but now I'm receiving an error when BasicSR references torchvision.transforms.functional_tensor

  File "/home/ubuntu/ml/GFPGAN/overlap_fb_retouch.py", line 6, in <module>
    from gfpgan import GFPGANer
  File "/home/ubuntu/ml/GFPGAN/gfpgan/__init__.py", line 2, in <module>
    from .archs import *
  File "/home/ubuntu/ml/GFPGAN/gfpgan/archs/__init__.py", line 2, in <module>
    from basicsr.utils import scandir
  File "/opt/conda/lib/python3.10/site-packages/basicsr/__init__.py", line 4, in <module>
    from .data import *
  File "/opt/conda/lib/python3.10/site-packages/basicsr/data/__init__.py", line 22, in <module>
    _dataset_modules = [importlib.import_module(f'basicsr.data.{file_name}') for file_name in dataset_filenames]
  File "/opt/conda/lib/python3.10/site-packages/basicsr/data/__init__.py", line 22, in <listcomp>
    _dataset_modules = [importlib.import_module(f'basicsr.data.{file_name}') for file_name in dataset_filenames]
  File "/opt/conda/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "/opt/conda/lib/python3.10/site-packages/basicsr/data/realesrgan_dataset.py", line 11, in <module>
    from basicsr.data.degradations import circular_lowpass_kernel, random_mixed_kernels
  File "/opt/conda/lib/python3.10/site-packages/basicsr/data/degradations.py", line 8, in <module>
    from torchvision.transforms.functional_tensor import rgb_to_grayscale
ModuleNotFoundError: No module named 'torchvision.transforms.functional_tensor'

Seems related to: https://github.com/TencentARC/GFPGAN/issues/539