BUG: pytorch not recognizing CUDA 12.4/12.5 drivers

MouseLand / Kilosort

Fast spike sorting with drift correction

https://kilosort.readthedocs.io/en/latest/

GNU General Public License v3.0

484 stars 248 forks source link

BUG: pytorch not recognizing CUDA 12.4/12.5 drivers #733

Closed ryan-budde closed 4 months ago

ryan-budde commented 4 months ago

Describe the issue:

What version of torch/pytorch do y'all use?

I am working on getting KS4 set up a slurm cluster with an A100 that has drivers for 12.4/12.5 (I cannot easily change these, and they should be backward compatible to 11.8). I'm using the 11.8 toolkit and I've followed the dev KS4 install instructions (python 3.9, pytorch-cuda=11.8 etc.). Everything looks right but torch.cuda.is_available() always fails. torch.version.cuda shows 11.8 as expected. nvidia-smi shows my A100, and nvcc --version shows 11.8.

I'm working on if it's my fault, the cluster's fault, or pytorch's fault, and I want to check a known-working version of pytorch (I ask because it is not specified in the install)

Reproduce the bug:

n/a

Error message:

n/a

Version information:

My torch says 2.3.1+cu118

jacobpennington commented 4 months ago

I have been using pytorch version 2.2.1. I know I've used at least a few other versions >2.0 over the course of development but don't remember the exact numbers.

I see a number of people with similar pytorch issues related to 12.4 and 12.5 from a quick google search. It looks like some of them were able to fix it by uninstalling their existing cuda and pytorch installation (or starting a fresh environment), then installing the 12.1 toolkit instead. I.e. following the instructions in our read me but change pytorch-cuda=11.8 to pytorch-cuda=12.1.

ryan-budde commented 4 months ago

I have been using pytorch version 2.2.1. I know I've used at least a few other versions >2.0 over the course of development but don't remember the exact numbers.

I see a number of people with similar pytorch issues related to 12.4 and 12.5 from a quick google search. It looks like some of them were able to fix it by uninstalling their existing cuda and pytorch installation (or starting a fresh environment), then installing the 12.1 toolkit instead. I.e. following the instructions in our read me but change pytorch-cuda=11.8 to pytorch-cuda=12.1.

Thanks! Question then - are there important specific reasons that you specify cuda 11.8 and python 3.9? I can see the numpy<2.0 was due to a known bug. Are there known bugs on CUDA>11.8 and python >3.9? Or are these simply the ones you used in development, and are known to work?

jacobpennington commented 4 months ago

Those are just the versions used in development. As noted in the readme, python 3.10 should work as well (and we include that version in our testing). Anything outside of 3.9 and 3.10, it might work, but we don't specifically test those right now so there might be some new or deprecated functions that cause errors. There is also a note about determining the correct versions for pytorch and cuda, with a link that might be helpful:

If pytorch installation still fails, follow the instructions here to determine what version of pytorch to install. The Anaconda install is strongly recommended on Windows, and then choose the CUDA version that is supported by your GPU (newer GPUs may need newer CUDA versions > 10.2)

ryan-budde commented 4 months ago

Update: the terminal is recognizing everything and this is the fault of VS code / jupyter. Not a KS issue. Currently working on running the KS example in a simple .py