MouseLand / Kilosort

Fast spike sorting with drift correction for up to a thousand channels
https://kilosort.readthedocs.io/en/latest/
GNU General Public License v3.0
449 stars 235 forks source link

BUG: CUDA out of memory #677

Closed adam-hockley closed 3 months ago

adam-hockley commented 4 months ago

Describe the issue:

Hello, I'm getting a CUDA out of memory error during the drift correction, when using the kilosort gui.

Trying to run a 64 channel, ~8gb recording on an RTX3060 (12gb VRAM).

Thanks for any help!

Reproduce the bug:

No response

Error message:

Traceback (most recent call last):

  File "C:\Users\ANL\anaconda3\envs\SI_env_fresh\lib\site-packages\kilosort\gui\sorter.py", line 70, in run

ops, bfile, st0 = compute_drift_correction(

  File "C:\Users\ANL\anaconda3\envs\SI_env_fresh\lib\site-packages\kilosort\run_kilosort.py", line 350, in compute_drift_correction

ops, st = datashift.run(ops, bfile, device=device, progress_bar=progress_bar)

  File "C:\Users\ANL\anaconda3\envs\SI_env_fresh\lib\site-packages\kilosort\datashift.py", line 192, in run

st, _, ops  = spikedetect.run(ops, bfile, device=device, progress_bar=progress_bar)

  File "C:\Users\ANL\anaconda3\envs\SI_env_fresh\lib\site-packages\kilosort\spikedetect.py", line 255, in run

xy, imax, amp, adist = template_match(X, ops, iC, iC2, weigh, device=device)

  File "C:\Users\ANL\anaconda3\envs\SI_env_fresh\lib\site-packages\kilosort\spikedetect.py", line 140, in template_match

A = torch.einsum('ijk, jklm-> iklm', weigh, B[iC,:, nb*t:nb*(t+1)])        

  File "C:\Users\ANL\anaconda3\envs\SI_env_fresh\lib\site-packages\torch\functional.py", line 385, in einsum

return _VF.einsum(equation, operands)  # type: ignore[attr-defined]

torch.cuda
.
OutOfMemoryError
: 
CUDA out of memory. Tried to allocate 1.37 GiB. GPU

Version information:

CUDA 11.8 KS 4.0.5 Python 3.9.19

Context for the issue:

No response

Experiment information:

No response

ananmoran commented 4 months ago

Hi, I have similar issues with CUDA. Until fixed by the KS team, and if you are familiar with python, try adding torch.cuda.empty_cache() before and after GPU-heavy functions. This will release unused memory of the GPU. I hope it will help Anan

jacobpennington commented 4 months ago

@ananmoran It is very unlikely that this is related to your issue for this size of recording.

@adam-hockley What does your probe layout look like, and what settings did you change if any?

adam-hockley commented 4 months ago

The probe is 1x64 linear array.

Probe was made in the kilosort gui with: y-coords: np.linspace(0, 1260, num=64) x-coords: np.linspace(50, 50, num=64) (also had the error with x=0 instead of 50) chan map: np.linspace(0, 63, num=64)

jacobpennington commented 4 months ago

Okay, two follow-ups then:

1) Can you please paste in the rest of the output you got while sorting, so I can see if anything else looks off? 2) Can you try sorting without drift correction, by setting n_blocks = 0? That probe is just barely at the minimum recommendation for sampling density to get good drift estimates, so it's possible that's introducing an artifact leading to this issue.

adam-hockley commented 4 months ago

Here's the whole ouput for the error in drift correction.

I also get a CUDA out of memory error later on if I skip the drift correction. output pasted further down.


Preprocessing filters computed in 0.43s; total 0.43s

computing drift

Re-computing universal templates from data.

C:\Users\ANL\anaconda3\envs\SI_env_fresh\lib\site-packages\threadpoolctl.py:1223: RuntimeWarning: Found Intel OpenMP ('libiomp') and LLVM OpenMP ('libomp') loaded at the same time. Both libraries are known to be incompatible and this can cause random crashes or deadlocks on Linux when loaded in the same Python program. Using threadpoolctl may cause crashes or deadlocks. For more information and possible workarounds, please see https://github.com/joblib/threadpoolctl/blob/master/multiple_openmp.md

warnings.warn(msg, RuntimeWarning)

C:\Users\ANL\anaconda3\envs\SI_env_fresh\lib\site-packages\kilosort\spikedetect.py:242: UserWarning: NaNs and/or zeroes present in weights for spikedetect.run, may need to adjust min_template_size and/or dminx for best results.

          If you're using a probe with multiple shanks, see 
          https://kilosort.readthedocs.io/en/latest/multi_shank.html

warnings.warn(msg, UserWarning)

0%| | 0/1032 [00:00<?, ?it/s]

0%| | 0/1032 [00:01<?, ?it/s]

Traceback (most recent call last):

File "C:\Users\ANL\anaconda3\envs\SI_env_fresh\lib\site-packages\kilosort\gui\sorter.py", line 70, in run

ops, bfile, st0 = compute_drift_correction(

File "C:\Users\ANL\anaconda3\envs\SI_env_fresh\lib\site-packages\kilosort\run_kilosort.py", line 350, in compute_drift_correction

ops, st = datashift.run(ops, bfile, device=device, progress_bar=progress_bar)

File "C:\Users\ANL\anaconda3\envs\SI_env_fresh\lib\site-packages\kilosort\datashift.py", line 192, in run

st, _, ops = spikedetect.run(ops, bfile, device=device, progress_bar=progress_bar)

File "C:\Users\ANL\anaconda3\envs\SI_env_fresh\lib\site-packages\kilosort\spikedetect.py", line 255, in run

xy, imax, amp, adist = template_match(X, ops, iC, iC2, weigh, device=device)

File "C:\Users\ANL\anaconda3\envs\SI_env_fresh\lib\site-packages\kilosort\spikedetect.py", line 140, in template_match

A = torch.einsum('ijk, jklm-> iklm', weigh, B[iC,:, nbt:nb(t+1)])

File "C:\Users\ANL\anaconda3\envs\SI_env_fresh\lib\site-packages\torch\functional.py", line 385, in einsum

return _VF.einsum(equation, operands) # type: ignore[attr-defined]

torch.cuda . OutOfMemoryError : CUDA out of memory. Tried to allocate 1.37 GiB. GPU


Preprocessing filters computed in 0.41s; total 0.41s

computing drift

nblocks = 0, skipping drift correction

drift computed in 0.00s; total 0.41s

Extracting spikes using templates

Re-computing universal templates from data.

C:\Users\ANL\anaconda3\envs\SI_env_fresh\lib\site-packages\kilosort\spikedetect.py:242: UserWarning: NaNs and/or zeroes present in weights for spikedetect.run, may need to adjust min_template_size and/or dminx for best results.

          If you're using a probe with multiple shanks, see 
          https://kilosort.readthedocs.io/en/latest/multi_shank.html

warnings.warn(msg, UserWarning)

0%| | 0/1032 [00:00<?, ?it/s]

0%| | 0/1032 [00:00<?, ?it/s]

Traceback (most recent call last):

File "C:\Users\ANL\anaconda3\envs\SI_env_fresh\lib\site-packages\kilosort\gui\sorter.py", line 82, in run

st, tF, Wall0, clu0 = detect_spikes(ops, self.device, bfile, tic0=tic0,

File "C:\Users\ANL\anaconda3\envs\SI_env_fresh\lib\site-packages\kilosort\run_kilosort.py", line 398, in detect_spikes

st0, tF, ops = spikedetect.run(ops, bfile, device=device, progress_bar=progress_bar)

File "C:\Users\ANL\anaconda3\envs\SI_env_fresh\lib\site-packages\kilosort\spikedetect.py", line 255, in run

xy, imax, amp, adist = template_match(X, ops, iC, iC2, weigh, device=device)

File "C:\Users\ANL\anaconda3\envs\SI_env_fresh\lib\site-packages\kilosort\spikedetect.py", line 150, in template_match

Amax = torch.max(Aa[iC2], 0)[0]

torch.cuda . OutOfMemoryError : CUDA out of memory. Tried to allocate 2.83 GiB. GPU

jacobpennington commented 4 months ago

Hmm... thanks. The warning about nans/zeros is definitely not expected for a single-shank linear probe. Would you be willing to share the data so I can debug this?

adam-hockley commented 4 months ago

Sure, what's the best way to share it?

jacobpennington commented 4 months ago

Putting it on a google drive has been working well for others, if that's an option for you. Then you can post the link in a reply here, or you can send it to my email at jacob.p.neuro@gmail.com if you don't want it visible publicly.

adam-hockley commented 4 months ago

Thanks, here's the link. It's a 64 channel i16 file

https://drive.google.com/file/d/1_ngJKdbuHlN1a4KJpAspv899O7bTYxmt/view?usp=sharing

jacobpennington commented 4 months ago

Got it, thanks. Is it 30kHz sampling rate?

adam-hockley commented 4 months ago

It's 24414.0625 (TDT)

jacobpennington commented 4 months ago

Well, the good news is I was able to sort the data without any errors using the default settings. The bad news is I am also using a card with 12gb of vram (and have sorted larger data sets with less vram in the past), so I'm not sure why you're seeing that error and I'm not.

A couple things to check: 1) Are there multiple video cards in that machine? If so, make sure the right one is selected in the "PyTorch device" dropdown menu. 2) Are there other applications running that might be using up the vram? An easy way to check is entering nvidia-smi in a terminal (without Kilosort running), which should give an output like this:

(kilosort4) PS C:\code\Kilosort> nvidia-smi
Fri Apr 26 12:32:58 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 552.22                 Driver Version: 552.22         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                     TCC/WDDM  | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4070 Ti   WDDM  |   00000000:01:00.0  On |                  N/A |
|  0%   42C    P0             31W /  285W |    1303MiB /  12282MiB |      6%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
adam-hockley commented 4 months ago

Thanks for the helpo Jacob, both of those tests looked normal, there's only one discrete GPU and it was being used by CUDA with nothin gthe in background. I've tested a new envirnoment and with kilosort 4.0.6 I dont have any issue!

igfebbo commented 3 months ago

Hi @jacobpennington , I am having the same issue as Adam. I am also using a 64 channel probe. I have tried all of the above mentioned fixes to no avail. Would I also be able to send you some test data and see if you are able to run it?

jacobpennington commented 3 months ago

Hi @igfebbo , yes that would be fine.

igfebbo commented 3 months ago

Thank you very much! I am using a 64 channel H9 probe. Let me know if you need any other information. Here is the link:

https://drive.google.com/file/d/1DpwB0iItG3VtcXFZykVgUpWOSLE4a9k9/view?usp=drive_link

jacobpennington commented 3 months ago

I'm unable to access it @igfebbo, I sent an access request through google from jacob.p.neuro@gmail.com.

igfebbo commented 3 months ago

I have just shared it.

On Mon, May 6, 2024 at 1:24 AM Jacob Pennington @.***> wrote:

I'm unable to access it @igfebbo https://github.com/igfebbo, I sent an access request through google from @.***

— Reply to this email directly, view it on GitHub https://github.com/MouseLand/Kilosort/issues/677#issuecomment-2095016851, or unsubscribe https://github.com/notifications/unsubscribe-auth/BG56ZQJT74IFOBUBNLFJBDLZA3EVTAVCNFSM6AAAAABGYMOMHOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJVGAYTMOBVGE . You are receiving this because you were mentioned.Message ID: @.***>

igfebbo commented 3 months ago

Sorry about that!

On Mon, May 6, 2024 at 7:14 PM Isabella Febbo @.***> wrote:

I have just shared it.

On Mon, May 6, 2024 at 1:24 AM Jacob Pennington @.***> wrote:

I'm unable to access it @igfebbo https://github.com/igfebbo, I sent an access request through google from @.***

— Reply to this email directly, view it on GitHub https://github.com/MouseLand/Kilosort/issues/677#issuecomment-2095016851, or unsubscribe https://github.com/notifications/unsubscribe-auth/BG56ZQJT74IFOBUBNLFJBDLZA3EVTAVCNFSM6AAAAABGYMOMHOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJVGAYTMOBVGE . You are receiving this because you were mentioned.Message ID: @.***>

jacobpennington commented 3 months ago

@igfebbo What type of probe are you using? Can you share the probe file, or paste in the output of "Print Probe" in the GUI? I'm not seeing any immediate issues, so it might be related to the probe geometry.

igfebbo commented 3 months ago

I am using a Cambridge neurotech, H9 probe. Here is the probe file: https://drive.google.com/file/d/1SakRB01PjjKae8OkNIZCXQXyM2XXlc8a/view?usp=sharing

30kHz sampling rate

jacobpennington commented 3 months ago

Thanks. I still didn't see any memory problems, it never used more than about 2GB of video memory at any one time. I know it's annoying, but did you try simply restarting the machine and sorting again? You would get the same error if other processes were using up the video memory.

igfebbo commented 3 months ago

Thank you for running it. We rebooted the server and got the same error.

I am also now getting this error: Non-native QFileDialog supports only local files Non-native QFileDialog supports only local files

jacobpennington commented 3 months ago

Okay. I'll look into that, but you should be able to get around it by just typing or copy-pasting the file location into the text area instead of clicking on "choose file."

igfebbo commented 3 months ago

Thanks, Jacob. We've noticed that we're getting a warning early on that might be informative... For reference, we're running a 2080 Super, driver 525.147.05, and CUDA v12.0. This is a multi-user Debian 12 machine, but no one else is accessing the GPU when we run our tests.

When we load the file and probe map we sent you, memory usage on the GPU goes from 0 to 1.3GB. We then see this warning:

/home/randy/.local/lib/python3.9/site-packages/kilosort/io.py:497: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /opt/conda/conda-bld/pytorch_1712608885084/work/torch/csrc/utils/tensor_numpy.cpp:206.) X[:, self.nt : self.nt+nsamp] = torch.from_numpy(data).to(self.device).float()

When we click Run, memory usage jumps to about 7GB and then we see the error message that the GPU couldn't allocate an additional 2GB of RAM (which makes sense as this is an 8GB GPU).

jacobpennington commented 3 months ago

The warning isn't meaningful, it's something we're aware of. However, we did suppress that warning a few versions ago around the same time that we implemented some bug fixes that could cause memory problems. Can you please install the latest version of Kilosort and try again? Or let me know if you're already using the latest version and still seeing that warning.

igfebbo commented 3 months ago

Ok, we got it running now. We did not realize that pip was defaulting to 4.0 rather than the latest version (4.0.6 yesterday).