flatironinstitute / CaImAn

Computational toolbox for large scale Calcium Imaging Analysis, including movie handling, motion correction, source extraction, spike deconvolution and result visualization.
https://caiman.readthedocs.io
GNU General Public License v2.0
639 stars 370 forks source link

Kernel dying during motion correction #1338

Closed Mraymon5 closed 6 months ago

Mraymon5 commented 6 months ago

Your setup:

  1. Operating System: Ubuntu 22.04.4
  2. Hardware type (x86, ARM..) and RAM: Intel® Core™ i7-14700K × 28, 64GiB ram
  3. Python Version (e.g. 3.9): 3.9
  4. Caiman version (e.g. 1.9.12): 1.10.3
  5. Which demo exhibits the problem (if applicable): demo_pipeline_cnmfE.py
  6. How you installed Caiman (pure conda, conda + compile, colab, ..): Conda
  7. Details:

I recently started working on a new Ubuntu machine, and installed python, caiman, and spyder, which I was familiar with using on my previous Windows machine.

I have a modified version of demo_pipeline_cnmfE, and when I was trying to run it on my data, I got an error when I started running the motion correction chunk. The first sign that something was wrong was that the console started churning out this warning: 81873 [movies.py: load():1332][247231] Your tif file is saved a single pagefile. Performance will be affected The first number seems to have something to do with the number of frames in the .tif, and a new error is produced at seemingly arbitrary intervals [81873, 82048, 82079, 82414, 82548, ...]. The second number (1332) is the same every time, and the third number seems to move around somewhat randomly in a range of 247212-247235.

Anyway, Spyder produces hundreds (thousands?) of these warnings, then eventually changes to a different warning:

/home/[user]/anaconda3/envs/caiman/lib/python3.9/site-packages/spyder/plugins/ipythonconsole/scripts/conda-activate.sh: line 18: 247106 Killed                  $CONDA_ENV_PYTHON -m spyder_kernels.console -f $SPYDER_KERNEL_SPEC

Restarting kernel...

Which is only printed once, because then the kernel is restarted.

At first I was worried that one of my modifications to the code was causing the problem, so I then ran the default demo_pipeline_cnmfE code. If I run it on the included video files, it gets through the motion correction just fine. However, if I change it to work on MY video (changing nothing else), then the same string of errors is produced once motion correction starts.

Mraymon5 commented 6 months ago

The TIF file is very large (13 gb; 71370 frames, 390x361 px). I've run a downsampled version of it successfully in the past (on the previous Windows machine).

In the past when I've run videos that were too large for my machine, though, I got a Memory error, not whatever this is.

pgunn commented 6 months ago

I think the reason you get that from spyder is just that that log comes from spyder's point of view - it's actually not a warning from spyder's perspective, but the ipython-kernel launched to do the job is dying (presumably killed by the Linux kernel). I imagine if you were to do dmesg you'd see a message about the OOM killer needing to kill your process.

One thing you might try is converting your video to hdf5 format; I think this is less likely to run into memory issues. Another option would be to split it into multiple tiff files, but the former is more likely to get you everything you need, I think.

Gigantic data files can be a pain either way.

kushalkolar commented 6 months ago

A 13GB raw tiff file with 64GB of RAM is probably pushing it, I would strongly recommend trying the online algorithm. If you really need the offline algorithm I would reduce n_processes to a handful.

   81873 [movies.py:                load():1332][247231] Your tif file is saved a single pagefile. Performance will be affected

This is just a warning that it's saved as a singepage tiff file as opposed to a multipage tif file.

Mraymon5 commented 6 months ago

I've played around with the problem a bit more, with interesting results:

I tried converting the file to hdf5 with ImageJ, but ran into an "insufficient memory" error there as well, so I'll have to keep messing around with that. I think ImageJ caps its own memory use, so hopefully that's the problem there. I still haven't tried the online algorithm, so that's still on the docket. I did try downsampling the file to just try to get it to run, with mixed results. I first tried it at ~7gb, which I would have though should be small enough, but I still got the exact same issues as with 13gb. I finally cut it way down to ~4gb, which had a couple of effects: I stopped getting the warning about the .tif being singlepage as apposed to multipage, and the motion correction completed successfully. However, when I ran the "compute summary images" chunk, the kernel self-destructed again.

I'm going to try a few things: -Figuring out the hdf5 conversion, to see if that helps -Trying the online algorithm -Allocating way more virtual memory on my machine

Is there anything else that I'm missing/ ought to try?

kushalkolar commented 6 months ago

Reduce n processes

On Thu, May 2, 2024, 12:25 PM Mraymon5 @.***> wrote:

I've played around with the problem a bit more, with interesting results:

I tried converting the file to hdf5 with ImageJ, but ran into an "insufficient memory" error there as well, so I'll have to keep messing around with that. I think ImageJ caps its own memory use, so hopefully that's the problem there. I still haven't tried the online algorithm, so that's still on the docket. I did try downsampling the file to just try to get it to run, with mixed results. I first tried it at ~7gb, which I would have though should be small enough, but I still got the exact same issues as with 13gb. I finally cut it way down to ~4gb, which had a couple of effects: I stopped getting the warning about the .tif being singlepage as apposed to multipage, and the motion correction completed successfully. However, when I ran the "compute summary images" chunk, the kernel self-destructed again.

I'm going to try a few things: -Figuring out the hdf5 conversion, to see if that helps -Trying the online algorithm -Allocating way more virtual memory on my machine

Is there anything else that I'm missing/ ought to try?

— Reply to this email directly, view it on GitHub https://github.com/flatironinstitute/CaImAn/issues/1338#issuecomment-2090962129, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACHXXRASPX2V4W254CLNZZ3ZAJSH7AVCNFSM6AAAAABG6XKEDWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJQHE3DEMJSHE . You are receiving this because you commented.Message ID: @.***>

Mraymon5 commented 6 months ago

I did try that also; I even went as far as setting the process # to 1, which didn't seem to change anything.

Mraymon5 commented 6 months ago

I think I've figured out the issue: for some reason my swap file size was set to only 2gb. I dialed that up to 120gb, which seems to have helped quite a bit.

Mraymon5 commented 6 months ago

Okay, I thought that fixed the problem, but now I'm getting some extremely weird behavior. The crashes no longer occur during motion correction, they now happen usually during the "compute summary images" step. Also, the crashing problem has changed; now, instead of the kernel resetting, the whole program just terminates; both Spyder and the python terminal close.

I've tried running it while keeping a close eye on the system monitor, and that has not made things more clear. I see this strange behavior where Python will get up to about 50% use of the RAM, and then it will drop off, or it will hold stead at 50% even while it's apparently still working hard. At some point, the RAM use will start to creep up over that 50% line, and once that happens it rises pretty quickly to 100% RAM use, and that's when python closes. It seems to not want to use the swap memory at all most of the time. Intermittently, with no obvious rhyme or reason, it WILL use the swap, and then everything works fine. I was able to successfully process one session of imaging data that way, but I can't consistently replicate that.

Anecdotally, I've also noticed that the .mmap files that used to be saved in the same folder as the source video are now saving in ~/caiman_data/temp; I'm not sure if that has anything to do with this, or is just an update in caiman's behavior.

pgunn commented 6 months ago

The temporary files now being in caiman_data/temp is a tidiness improvement in the most recent release. There will be more of these to come in the future (including per-run directories), but that won't cause any differences in memory usage.

When your system runs out of memory for user processes, a number of things can happen, but on Linux usually part of the kernel called the oom_killer will try to figure out the best program to kill to keep the system stable.

I'll look at the summary images code tomorrow to see if I can figure out what's going on. I may reach out to ask for more info. Is there any chance I can get a copy of your modified notebook (without clearing cells)? Particularly after you tried setting the number of processes to 1. I'd like to give it a look.

Cheers, Pat

pgunn commented 6 months ago

To be clear, you're seeing the crash during caiman.summary_images.correlation_pnr() ? Or later? Do you see the same result if you run the script outside of spyder?

(there is a newer version of the CLI demos that you might or might not be using that takes json configfiles and has fixes so the demos work again from the command prompt)

Mraymon5 commented 6 months ago

Okay, I ran some tests running the code line-by-line, and it definitely crashes at caiman.summary_images.correlation_pnr(). However, if I skip that chunk and go straight to CNMF, it still crashes in the same way at caiman.source_extraction.CNMF().

There's this weird behavior where python starts to ramp up RAM usage, hits 50%, and stops. To indulge in a little anthropomorphizing, it looks like it really wants to not use more than 50% of RAM, and REALLY wants to not use swap. Like, when it hits ~50%, it will pull back (maybe starts writing things to the mmap?). This behavior is also evident in the motion correction step, but for some reason it manages to pull it off there: MemUseMotionCorrect

But fails at the more memory-intensive steps. So, when I run summary_images.correlation_pnr(), I see that same trace, MemUseCNMF_1 MemUseCNMF_2

Where it holds at 50% for a while like it really doesn't want to go higher, but eventually the task gets too heavy, memory use climbs up past 50% to 100%, and once it's forced to start seriously using Swap, it just crashes instead.

I also wanted to see if it was specifically a problem with MY memory and OS, so I just opened up a ton of stuff in ImageJ, and did not see that same behavior. As I opened more stuff, RAM use just climbed steadily to 100%, then started using swap, and never crashed.

I tried changing swappiness in the terminal before running python, and that actually worked, once. I was able to run Caiman all the way through one file, and didn't see any of that weird memory behavior; it didn't stall at 50% RAM, it just went straight to 100%, then started using swap, and never crashed. Then I started running a different file, and the same behavior came right back, and changing swappiness didn't do anything to alter it.

Mraymon5 commented 6 months ago

Here's the script that I modified. It also references edits that I made to Visualization, so I've attached that also. Neither of them are doing anything particularly fancy; it's all small quality-of-life stuff.

Caiman_Adapted.txt Edited_Visualization.txt

pgunn commented 6 months ago

And when it crashes that way, if you do dmesg, does it show the oom killer doing it, maybe with a stack trace?

Have you tried running it not under spyder, meaning really from the CLI?

Mraymon5 commented 6 months ago

When it crashes, it has been fully closing both Spyder and the python terminal, which I think may be interfering with the dmesg logging; I've got two events that may be associated with crashes (below), but I've run the pipeline through to a crash a few times now, and most of the time nothing is added to the dmesg log when I reload the python terminal. I'm not super comfortable in python, so I've avoided trying to run things outside of spyder, but I'll work on that now.

Here are the maybe-crash-related-messages from dmesg. [241508.058449] workqueue: inode_switch_wbs_work_fn hogged CPU for >10000us 8 times, consider switching to WQ_UNBOUND [245714.136866] perf: interrupt took too long (6204 > 6157), lowering kernel.perf_event_max_sample_rate to 32000

Mraymon5 commented 6 months ago

Okay, I've tried running it directly in the terminal now, and the behavior is exactly the same as running it in spyder: it gets through motion correction alright, then starts correlation_pnr(), then crashes. Watching the system monitor, the memory use behavior is exactly the same also: motion correction hits but never passes 50% ram, and then correlation_pnr() pulls RAM to 50%, plateaus for a while, rises slowly to 100%, and then crashes. No messages were added to dmesg when I reloaded the terminal.

Maddeningly, I was able to process two imaging sessions without any hiccups this morning (including one that had previously drawn all the typical crashes), and then on the 3rd one the crashes started again. I am not aware of changing anything at all to kick off that behavior.

pgunn commented 6 months ago

Thanks; spyder's ram usage and handling of threads sometimes can produce worse results, but it looks like that isn't the case here.

I think the best options at this point are: A) Switch to the online algorithm B) Downscale your data C) Find beefier hardware to run it on

There are tradeoffs with each, but route A is probably the best; online has more recent development and may produce better results, and it's designed to be more RAM-efficient.

Mraymon5 commented 6 months ago

I tried another debugging test by running a snippet in python that just used up memory, to see if the issue was exclusive to caiman, and while I didn't see any of the behavior of lingering around 50% RAM (I'm guessing that has something to do with the mmap implementation in caiman?), once my RAM was filled, the swap engaged, and python crashed. So caiman itself doesn't seem to be the culprit, and I'll just have to figure out why my swap file is only working intermittently.

Mraymon5 commented 6 months ago

After a bit more testing, it looks like Python was crashing any time it had to use virtual memory. That wasn't specific to caiman, and in fact didn't seem to be specific to any particular version of python.

Ultimately I ended up wiping my OS and doing a fresh install of Ubuntu 24.04 instead, and that seems to have fixed the problem. Fingers crossed; I've been able to run a good number of sessions without issue.

pgunn commented 6 months ago

I'm glad things are working for you; Python should not generally crash over needing to use virtual memory (unless it needs more than is possible); a larger swap partition can help of course.

Remember to give the online algorithm a try if you can.

Cheers, Pat