MouseLand / Kilosort

Fast spike sorting with drift correction

https://kilosort.readthedocs.io/en/latest/

GNU General Public License v3.0

488 stars 247 forks source link

Fixed bug where cuda reserved memory climbs throughout process while allocated memory stays low #758

Closed Lathomas42 closed 3 months ago

Lathomas42 commented 3 months ago

This seems to be a bug in torch, however it appears that when reference some fragment of large torch arrays, even when the variable goes out of scope (Xg in cluster), if a portion of it is copied / referenced, the whole array will remain in memory, as reserved memory. You can force torch to release this memory by calling empty_cache. I am not sure if this is specific to my setup, however my system specs are:

GPU: 1080 Ti u OS: ubuntu 20.04 Cuda: 11.8 Torch: 2.3.1+cu118 Kilosort: 0.1.dev1248+gc664741

The impact of this change is easily viewable by adding after the cluster call.

torch.cuda.reset_peak_memory_stats(device=device)
print(torch.cuda.memory_reserved(device=device))
# or print(torch.cuda.memory_summary())

I think this is related to bugs:

746

670

743

After this change I can sort a file that would fail 100% of the time without this change. when reverting it fails again. My GPU memory consumption actually is drastically lower using this change.

jacobpennington commented 3 months ago

@Lathomas42 Are you able to share the data you're seeing this issue with? Also, can you please share the error message you're getting without making that change?

I want to look into this more before making that change, because that is not how reserved memory works. Clearing the reserved cache on each iteration can slow down clustering substantially, because it forces pytorch to request new memory each time to allocate the new tensors, whereas reserved memory is already available for allocating new tensors. It does not mean that a tensor is still occupying that memory. Most likely this is instead pointing to a memory fragmentation issue that might be fixable without a performance hit.

Lathomas42 commented 3 months ago

@jacobpennington Sounds good, I figured this is the case, it only slowed down my clustering by a small percent. However I totally understand your desire to fix it, I spent a long time trying to track down exactly where this issue comes from using torch's memory profiling tools, and have no idea. I was more or less putting this here just so people who have these bugs can add these lines of code and get their data sorted. I know this held me up for a week or so, not being able to sort this file.

My error message is the same as #746 usually crashes on some line in cluster function, such as vexp = 2 * Xg @ Xc.T - (Xc**2).sum(1).

jacobpennington commented 3 months ago

I'm going to close this because I've added the change as an optional feature with v4.0.15 (using the clear_cache argument in run_kilosort or through the GUI). I will continue looking into where the memory fragmentation could be coming from, but that should address the issue in the meantime.