CUDA Error when running detector

davidwhealey commented 2 months ago

On a new conda install of the detector environment, I get a CUBLAS_STATUS_INVALID_VALUE error when trying to run the detector.

python run_detector.py ~/models/md_v5a.0.0.pt --image_file ~/tmp/test_imgs.JPG --output_dir ~/tmp/test_img

Traceback (most recent call last): File "run_detector.py", line 676, in main() File "run_detector.py", line 664, in main load_and_run_detector(model_file=args.detector_file, File "run_detector.py", line 371, in load_and_run_detector detector = load_detector(model_file) File "run_detector.py", line 342, in load_detector detector = PTDetector(model_file, force_cpu, USE_MODEL_NATIVE_CLASSES) File "/home/david/repos/MegaDetector/detection/pytorch_detector.py", line 122, in init self.model = PTDetector._load_model(model_path, self.device) File "/home/david/repos/MegaDetector/detection/pytorch_detector.py", line 155, in _load_model model = checkpoint['model'].float().fuse().eval() File "/home/david/repos/yolov5/models/yolo.py", line 231, in fuse m.conv = fuse_conv_and_bn(m.conv, m.bn) # update conv File "/home/david/repos/yolov5/utils/torch_utils.py", line 205, in fuse_conv_andbn fusedconv.weight.copy(torch.mm(w_bn, w_conv).view(fusedconv.weight.shape)) RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)

My environment is as follows:

Collecting environment information... PyTorch version: 1.10.1 Is debug build: False CUDA used to build PyTorch: 11.3 ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 Clang version: Could not collect CMake version: Could not collect Libc version: glibc-2.31

Python version: 3.8.15 | packaged by conda-forge | (default, Nov 22 2022, 08:46:39) [GCC 10.4.0] (64-bit runtime) Python platform: Linux-5.4.0-177-generic-x86_64-with-glibc2.10 Is CUDA available: True CUDA runtime version: 11.3.58 GPU models and configuration: GPU 0: NVIDIA GeForce GTX 1080 Ti GPU 1: NVIDIA GeForce GTX 1080 Ti

Nvidia driver version: 465.19.01 cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.4 HIP runtime version: N/A MIOpen runtime version: N/A

Versions of relevant libraries: [pip3] numpy==1.23.5 [pip3] torch==1.10.1 [pip3] torchvision==0.11.2 [conda] blas 2.121 mkl conda-forge [conda] blas-devel 3.9.0 21_linux64_mkl conda-forge [conda] cudatoolkit 11.3.1 hb98b00a_13 conda-forge [conda] libblas 3.9.0 21_linux64_mkl conda-forge [conda] libcblas 3.9.0 21_linux64_mkl conda-forge [conda] liblapack 3.9.0 21_linux64_mkl conda-forge [conda] liblapacke 3.9.0 21_linux64_mkl conda-forge [conda] mkl 2024.0.0 ha957f24_49657 conda-forge [conda] mkl-devel 2024.0.0 ha770c72_49657 conda-forge [conda] mkl-include 2024.0.0 ha957f24_49657 conda-forge [conda] numpy 1.23.5 py38h7042d01_0 conda-forge [conda] pytorch 1.10.1 py3.8_cuda11.3_cudnn8.2.0_0 pytorch [conda] pytorch-mutex 1.0 cuda pytorch [conda] torchvision 0.11.2 py38_cu113 pytorch

agentmorris commented 2 months ago

Oops, that's not good.

I know this is like the most generic IT answer, but... can you try upgrading your NVIDIA driver? That's a pretty old driver version, that actually predates MDv5 training (the Internet says it's from early 2021).

If that doesn't work, reply here and I'll suggest some more ideas, but I'm pretty optimistic that a driver upgrade will help.

davidwhealey commented 2 months ago

Any other ideas? I had previously upgraded to 550.78, which gave me a Failed to initialize NVML: Driver/library version mismatch

I subsequently removed all NVIDIA drivers and installed with a brand new install of the CUDA toolkit 11.3, which is how I got the current version. This was two days ago. The drivers are old, but that's what the CUDA toolkit runfile installed (see here) so I'm assuming they're compatible

agentmorris commented 2 months ago

Sigh, even the most ardent Linux enthusiasts have to admit that driver management is a horrendous pain on Linux... :)

Sorry to send you down this path again, but I think your upgrade to 550.78 was the right path, and I would recommend keeping that driver, and fixing the library mismatch, rather than rolling back. One common cause for that particular error ("Driver/library version mismatch") is upgrading the driver without rebooting; I get that error every time I upgrade my Nvidia driver on Ubuntu systems, until I reboot. Any chance you didn't reboot after the driver upgrade?

FWIW PyTorch installs CUDA within the Python environment, so in general the version of CUDA that you have installed on your system doesn't matter, in fact sometimes weird stuff can happen because PyTorch tools do use your system CUDA, when you don't want them to. So if you've upgraded the driver and rebooted and you're still getting errors, another thing to try... before running any PyTorch/MD stuff, in the console where you're going to run MD, run:

export LD_LIBRARY_PATH=

...to make sure there are no CUDA libraries on your system library path. This is not usually necessary, and I remain optimistic that updating the driver and rebooting will fix the issue. Fingers crossed.

davidwhealey commented 2 months ago

Okay, I did succeed at updating the NVIDIA driver, that didn't solve it. But removing the LD_LIBRARY_PATH variable did solve it. Apparently it was using the system CUDA anwyay. Thanks for your help!

agentmorris commented 2 months ago

Huzzah! Glad it worked out, happy MegaDetect'ing.

agentmorris / MegaDetector

CUDA Error when running detector #125