BUG: <Crash of Kilosort and/or computer during first clustering>

ttmysd70c6f5 commented 3 months ago

Describe the issue:

Hi, I have a problem that Kilosort4 and/or the computer crashes while running the first clustering. I observed that the CPU, RAM, and GPU usage was very high (close to 100%) while running Kilosort4.

My data were recorded with Neuropixel 1.0 with 384 channels, and its size is around 60~120GB. However, I could complete running kilosort4 without an issue when I ran it for a smaller dataset with the size of 2GB (downloaded from http://www.kilosort.org/downloads/ZFM-02370_mini.imec0.ap.bin). So I guess my dataset is too big to run kilosort4, but I think it should not be.

It is worth noting that my colleagues using similar hardware and datasets are not experiencing this issue. However, I still encountered crashes when I duplicated their conda environment on my hardware and ran spike sorting on my data.

I already tried to (1) use SSD, (2) reinstall conda and recreate conda environment, (3) apply "Clear PyTorch Cache" option, (4) try different dataset with a similar recording duration, and (5) check CUDA version, but they did not resolve the issue.

I have attached the settings of Kilosort4 when I had the crash. Kilosort4 created no log file in the output directory.

I greatly appreciate your help in figuring out the solution for this issue. Thank you!

Reproduce the bug:

No response

Error message:

No response

Version information:

My hardware is as follows:

CPU: intel core i9 13900K 3.00GHz OS: Windows10 64-bit RAM: 128GB GPU: NVIDIA GeForce RTX 4090

Kilosort 4.0.5 and 4.0.16 CUDA toolkit: 11.8 NVIDIA driver

Kilosort settings: settings = { 'data_file_path': WindowsPath('F:/Data/Kilosort4/test/20240208_082558_merged.probe1.dat'), 'results_dir': WindowsPath('F:/Data/Kilosort4/test/kilosort4'), 'probe': '... (use print probe)', 'probe_name': 'channelmap_probe1_240208_082558.mat', 'data_dtype': 'int16', 'n_chan_bin': 384, 'fs': 30000.0, 'batch_size': 60000, 'nblocks': 1, 'Th_universal': 9.0, 'Th_learned': 8.0, 'tmin': 0.0, 'tmax': inf, 'nt': 61, 'shift': None, 'scale': None, 'artifact_threshold': inf, 'nskip': 25, 'whitening_range': 32, 'highpass_cutoff': 300.0, 'binning_depth': 5.0, 'sig_interp': 20.0, 'drift_smoothing': [0.5, 0.5, 0.5], 'nt0min': None, 'dmin': None, 'dminx': 32.0, 'min_template_size': 10.0, 'template_sizes': 5, 'nearest_chans': 10, 'nearest_templates': 100, 'max_channel_distance': None, 'templates_from_data': True, 'n_templates': 6, 'n_pcs': 6, 'Th_single_ch': 6.0, 'acg_threshold': 0.2, 'ccg_threshold': 0.25, 'cluster_downsampling': 20, 'x_centers': None, 'duplicate_spike_ms': 0.25 }

jacobpennington commented 3 months ago

@ttmysd70c6f5 It says you used Kilosort versions 4.0.5 and 4.0.16. Did it crash in the same place when using the latter version? That should have generated a log file at 'F:/Data/Kilosort4/test/kilosort4/kilosort4.log'.

ttmysd70c6f5 commented 2 months ago

Hi @jacobpennington, thank you for your response, and I am sorry for my late reply. The crash happened in the same place regardless of the version. I have attached the log file for the crash with Kilosort=4.0.16. But the crash happened in a different place when I ran Kilosort for the same data on a different computer with a bigger RAM (192GB). Somehow, no log file was produced for that crash. I observed that the first clustering proceeded more when I ran Kilosort on the computer with a bigger RAM size. kilosort4.log

jacobpennington commented 2 months ago

Okay, thanks. Can you also attach a screenshot of what the Kilosort4 GUI looks like after loading the data? If you're using the GUI.

ttmysd70c6f5 commented 2 months ago

Sure, here is the screenshot of GUI. The loaded data is different from the one for the log I shared before, but I encountered the same crashing issue with this data too.

katjaReinhard commented 2 months ago

Hi all, I'm encountering a similar issue. I have a GeForce RTX 3060 graphics card, an SSD, plenty of RAM and CPU cores. My kilosort4 installation worked well for a NP1 recording of about 4500s length, but is crashing now always at the same point (Extracting spikes using cluster waveforms) for a ~6500s long recording. I've attached the log file. It seems that this should be far from maxing out my computer's capabilities. Any advice is appreciated! kilosort4.log

jacobpennington commented 2 months ago

@katjaReinhard Can you please check that you're attaching the correct log file? That one is identical to the one uploaded above. Maybe try renaming it before uploading, it's possible github isn't handling the name clash correctly.

katjaReinhard commented 2 months ago

thanks for the heads-up! I've attached it again as kilosort4_KR. kilosort4_KR.log

jacobpennington commented 2 months ago

Thanks. Just double checking, neither of you saw any error message in the GUI or terminal you were using to run Kilosort? I can see there's no error in the log file, I'm just wondering if there was any information elsewhere that didn't make it into the log for some reason. Otherwise, when it crashed, what happened exactly? Did the GUI just close on its own?

katjaReinhard commented 2 months ago

In my case the GUI closed and the only message in the terminal was "killed", otherwise nothing. As you can see in the log, we're loading a raw file after some preprocessing. I haven't tried yet running it directly on the bin file but we didn't have problems for other experiments with the raw files.

ttmysd70c6f5 commented 2 months ago

In my case, no message showed up in the terminal or the GUI except for this warning message:

C:\Users\User\anaconda3\envs\kilosort\lib\site-packages\threadpoolctl.py:1214: RuntimeWarning: 
Found Intel OpenMP ('libiomp') and LLVM OpenMP ('libomp') loaded at
the same time. Both libraries are known to be incompatible and this
can cause random crashes or deadlocks on Linux when loaded in the
same Python program.
Using threadpoolctl may cause crashes or deadlocks. For more
information and possible workarounds, please see
    https://github.com/joblib/threadpoolctl/blob/master/multiple_openmp.md

  warnings.warn(msg, RuntimeWarning)

though I guess this message is about something different. When the crash happened, the GUI just closed on its own. It was similar to when the computer crashed: the computer just shut down and restarted.

katjaReinhard commented 2 months ago

we are now getting still the "killed" error but also "CUDA error" when kilosort crashes. nvidia-smi says "gpu detected critical xid error". It seems that we have some issues with our graphics card, at least for long recordings. I'm not sure if that is the only issue as we didn't get the cuda error before though.

jacobpennington commented 2 months ago

Interesting.. that type of crash does make me think it's related to the hardware rather than Kilosort, but definitely something to check on. Are either of you able to share your data so I can try to reproduce the issue on my machine?

katjaReinhard commented 2 months ago

thank you for brain storming with us! I'll have to check some options to share data. in any case, the possible graphics card failure really bothered me so I'm now writing gpu and cpu details to a log file while running kilosort and I'm quite surprised to see that GPU usage is consistently around 6 GB (out of 12) while my CPU RAM is constantly close to the limit (25-30 GB, I have 32 GB). From the guidelines I understood that cpu shouldn't be that critical and 32 GB should be enough for these kind of recordings (<2h). Are there any settings in kilosort that I could adjust to reduce the cpu usage? I think that this is actually what killed kilosort (I'll know for sure once I have the final logs). I've already increased swap to 8 GB, but I'm wondering if there's anything else I can do.

katjaReinhard commented 2 months ago

@jacobpennington after installing kilosort4 from scratch on two computers running Linux as well as on one Windows computer, I can confirm based on the log files that the issue we encounter is that CPU RAM is maxed out. We have 32 GB RAM and it's hovering around 20-30 GB for most of the process and then usually around the Extracting Spikes or the latest at the first clustering step it exceeds 32 GB and crashes. We have an RTX 3060 12 GB graphics card with correctly installed Nvidia driver and CUDA. The graphics card is recognized by kilosort and used at around 5-7 GB throughout the process. Given the computer specs that are suggested on the kilosort site, this should not be a problem for a <2h recording with 1 probe. I also have the impression that other issues reported here might refer to the same problem. Do you have any idea why kilosort is using more CPU RAM than anticipated? This is really a big issue for us since we're sitting on a bunch of data that we have no way to process and analyse right now.

jacobpennington commented 2 months ago

Yes, using more than 32GB is definitely not expected for a recording that size. There are steps of the pipeline that primarily use GPU, and other steps that primarily use CPU, so presumably there's some reason why your data, probe, or settings are causing one of those steps to use a lot more memory than usual. Unfortunately, it's hard for me to debug that from the log alone.

Options to narrow things down would be:

Share the data if you can, so I can try to reproduce the issue.
Try adding some extra log statements to the code on your end in kilosort.template_matching.extract to figure out exactly which step it's crashing on.
Increase the amount of CPU memory if possible. 32 GB should be enough in most cases, but it's certainly possible that more could be needed depending on the data.

ttmysd70c6f5 commented 2 months ago

I can share my data with you. How can I send my data to you?

jacobpennington commented 2 months ago

A google drive link has worked best. You can post the link here, or e-mail it to me at jacob.p.neuro@gmail.com

ttmysd70c6f5 commented 2 months ago

I shared the google drive folder with the data and the channel map file. Here is the link of the drive: https://drive.google.com/drive/folders/1wwjdNpaIlEsRug1y0UUSkaYnu8MM9_K1?usp=sharing

katjaReinhard commented 2 months ago

@jacobpennington we just found out that the reason why kilosort used so much RAM in our hands is because the preprocessing step we did in spikeinterface saved the data as float64, which was not apparent at all. in any case, we changed the format in which the spikeinterface output is saved and now kilosort runs on our data. Apologies for taking up your time - we can only hope someone else will have a similar problem and be able to figure it out due to this report :)

jacobpennington commented 2 months ago

@katjaReinhard Great, thanks for the information!

Just a reminder for anyone else coming across this issue: before submitting a bug report, please try running KS4 alone without spikeinterface (or any other 3rd-party package) as an initial debugging step.

jacobpennington commented 1 week ago

@ttmysd70c6f5 Sorry for the long delay, I guess this got buried under the other issues. Update from my end: I was able to sort your data without issue using v4.0.16 and with the latest version as well. So, if you're still encountering this problem my recommendation would be:

1. Make sure your graphics card drivers are up to date.
2. If you're using `spikeinterface`, try running Kilosort4 alone.
3. Create a new environment using `environment.yml`, there are instructions at the bottom of the readme.
4. Install the latest version of Kilosort4 in the new environment, then try sorting again.

If you still encounter crashes after trying those steps, please upload kilosort4.log again (the new version has more information in the log), and I'll re-open the issue.

MouseLand / Kilosort