PetervanLunteren / EcoAssist

Simplify camera trap image analysis with ML species recognition models based around the MegaDetector model
MIT License
121 stars 16 forks source link

CUDA error: CUBLAS_STATUS_NOT_SUPPORTED #16

Closed animikhaich closed 1 year ago

animikhaich commented 1 year ago

After a standard installation, I tried to run the test steps as outlined here.

I encountered this error: image

As suggested by issues #9 and #6 I verified the existence of md_v5a.0.0.pt and md_v5b.0.0.pt in .EcoAssist_files/pretrained_models.

The stdout.txt log dump is given below:

EXECUTED: start_deploy({})

EXECUTED: deploy_model({'path_to_image_folder': '/home/ani/Downloads/test-images', 'selected_options': ['--output_relative_filenames', '--recursive'], 'data_type': 'img'})

EXECUTED: switch_yolov5_git_to({'model_type': 'old models'})

command:

["'/home/ani/.EcoAssist_files/miniforge/envs/ecoassistcondaenv/bin/python' '/home/ani/.EcoAssist_files/cameratraps/detection/run_detector_batch.py' '/home/ani/.EcoAssist_files/pretrained_models/md_v5a.0.0.pt' '--output_relative_filenames' '--recursive' '/home/ani/Downloads/test-images' '/home/ani/Downloads/test-images/image_recognition_file.json'"]

Fusing layers... 
5 image files found in the input directory
PyTorch reports 1 available CUDA devices
GPU available: True
Using PyTorch version 1.10.1
Traceback (most recent call last):
  File "/home/ani/.EcoAssist_files/cameratraps/detection/run_detector_batch.py", line 816, in <module>
    main()
  File "/home/ani/.EcoAssist_files/cameratraps/detection/run_detector_batch.py", line 785, in main
    results = load_and_run_detector_batch(model_file=args.detector_file,
  File "/home/ani/.EcoAssist_files/cameratraps/detection/run_detector_batch.py", line 402, in load_and_run_detector_batch
    detector = load_detector(model_file)
  File "/home/ani/.EcoAssist_files/cameratraps/detection/run_detector.py", line 289, in load_detector
    detector = PTDetector(model_file, force_cpu, USE_MODEL_NATIVE_CLASSES)        
  File "/home/ani/.EcoAssist_files/cameratraps/detection/pytorch_detector.py", line 50, in __init__
    self.model = PTDetector._load_model(model_path, self.device)
  File "/home/ani/.EcoAssist_files/cameratraps/detection/pytorch_detector.py", line 62, in _load_model
    model = checkpoint['model'].float().fuse().eval()  # FP32 model
  File "/home/ani/.EcoAssist_files/yolov5/models/yolo.py", line 231, in fuse
    m.conv = fuse_conv_and_bn(m.conv, m.bn)  # update conv
  File "/home/ani/.EcoAssist_files/yolov5/utils/torch_utils.py", line 205, in fuse_conv_and_bn
    fusedconv.weight.copy_(torch.mm(w_bn, w_conv).view(fusedconv.weight.shape))
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`
ERROR:
local variable 'elapsed_time' referenced before assignment

DETAILS:
Traceback (most recent call last):
  File "EcoAssist/EcoAssist_GUI.py", line 1232, in start_deploy
    deploy_model(var_choose_folder.get(), additional_img_options, data_type = "img")
  File "EcoAssist/EcoAssist_GUI.py", line 1072, in deploy_model
    progress_stats['text'] = create_md_progress_lbl(elapsed_time = elapsed_time,
UnboundLocalError: local variable 'elapsed_time' referenced before assignment

My System Information:

           `.:/ossyyyysso/:.               ani@Arc 
        .:oyyyyyyyyyyyyyyyyyyo:`           ------- 
      -oyyyyyyyodMMyyyyyyyysyyyyo-         OS: Kubuntu 23.04 x86_64 
    -syyyyyyyyyydMMyoyyyydmMMyyyyys-       Host: MS-7D43 1.0 
   oyyysdMysyyyydMMMMMMMMMMMMMyyyyyyyo     Kernel: 6.2.0-26-generic 
 `oyyyydMMMMysyysoooooodMMMMyyyyyyyyyo`    Uptime: 32 mins 
 oyyyyyydMMMMyyyyyyyyyyyysdMMysssssyyyo    Packages: 2819 (dpkg), 13 (snap) 
-yyyyyyyydMysyyyyyyyyyyyyyysdMMMMMysyyy-   Shell: bash 5.2.15 
oyyyysoodMyyyyyyyyyyyyyyyyyyydMMMMysyyyo   Resolution: 2560x1080 
yyysdMMMMMyyyyyyyyyyyyyyyyyyysosyyyyyyyy   DE: Plasma 5.27.4 
yyysdMMMMMyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy   WM: KWin 
oyyyyysosdyyyyyyyyyyyyyyyyyyydMMMMysyyyo   Theme: [Plasma], Breeze [GTK2/3] 
-yyyyyyyydMysyyyyyyyyyyyyyysdMMMMMysyyy-   Icons: [Plasma], Breeze-openSUSE Dark Icons [GTK2/3] 
 oyyyyyydMMMysyyyyyyyyyyysdMMyoyyyoyyyo    Terminal: konsole 
 `oyyyydMMMysyyyoooooodMMMMyoyyyyyyyyo     CPU: 12th Gen Intel i7-12700F (16) @ 4.800GHz 
   oyyysyyoyyyysdMMMMMMMMMMMyyyyyyyyo      GPU: NVIDIA GeForce RTX 3090 Ti 
    -syyyyyyyyydMMMysyyydMMMysyyyys-       Memory: 3879MiB / 31931MiB 
      -oyyyyyyydMMyyyyyyysosyyyyo-
        ./oyyyyyyyyyyyyyyyyyyo/.                                   
           `.:/oosyyyysso/:.`                                      

Nvidia Driver:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06   Driver Version: 525.125.06   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  Off |
|  0%   49C    P8    19W / 450W |    415MiB / 24564MiB |     17%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Default CUDA Version:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Jun__8_16:49:14_PDT_2022
Cuda compilation tools, release 11.7, V11.7.99
Build cuda_11.7.r11.7/compiler.31442593_0
agentmorris commented 1 year ago

Hmmm... I haven't seen this before. I'm about 80% sure this is not really a dimensionality issue, it's a CUDA version mismatch, where for some reason the system CUDA environment is being used instead of the Python environment's CUDA environment. A couple things to try, starting with the easiest:

  1. In the shell from which you are launching EcoAssist, try running:

    export LD_LIBRARY_PATH=''

    ...prior to starting EcoAssist. I'm 61% sure this will fix the problem, and if that's the case, we have an easy fix, and I get to grumble about how I wish CUDA installs wouldn't mess with LD_LIBRARY_PATH.

  2. It would help debug a little if we could take EcoAssist out of the loop just to remove a level of indirection, so if the person who owns the environment is up for it, it would be great to go through the MegaDetector setup instructions. If we can repro the issue there, we'll have a simpler time debugging.

  3. I don't really recommend that the environment owner do this, but FWIW, I think uninstalling CUDA entirely from the system will fix the issue. In principle I'd like to do this as a debugging step, but it's a big hammer to wield if the user is using the system CUDA for other things.

  4. I don't think we'll go past (2) just yet, but if (1) doesn't work, and we can repro the problem in a standalone Python environment (i.e., outside of EcoAssist), we can try to upgrade PyTorch in that environment to match the system CUDA version. If that works, we've at least verified that it was really a CUDA version mismatch, then we can decide what to do about it.

PetervanLunteren commented 1 year ago

@agentmorris Thanks for your response!

@animikhaich With regards to option 1, the easiest way to run export LD_LIBRARY_PATH='' prior to opening EcoAssist would be to add this line somewhere before the python command on line 109 in /home/ani/.EcoAssist_files/EcoAssist/open.command.

animikhaich commented 1 year ago

Thanks @agentmorris and @PetervanLunteren. Option 1 resolved it!