Out of memory with demo_pipeline_cnmfE.py

restrepd commented 1 year ago

For better support, please use the template below to submit your issue. When your issue gets resolved please remember to close it.

Sometimes errors while running CNMF occur during parallel processing which prevents the log to provide a meaningful error message. Please reproduce your error with setting dview=None.

If you need to upgrade CaImAn follow the instructions given in the documentation.

Tell us a bit about your setup:
1. Operating system (Linux/macOS/Windows): Ubuntu 20.04.5 LTS
2. Python version (3.x): Python 3.10.6
3. Working environment (Python IDE/Jupyter Notebook/other): python IDE (same problem with Jupyter notebook)
4. Which of the demo scripts you're using for your analysis (if applicable): demo_pipeline_cnmfE.py
5. CaImAn version*: caiman 1.9.11
6. CaImAn installation process (pip install ./pip install -e ./conda): /pip install -e .`

*You can get the CaImAn version by creating a params object and then typing params.data['caiman_version']. If the field doesn't exist, type N/A and consider upgrading).

Describe the issue that you are experiencing I am running a 9.6 Gbyte avi file with a slighly modified demo_pipeline_cnmfE.py and the program gets killed, sometimes before generating the mmap file, other times after generating a 71.9 Gbyte mmap file. When I enter "dmseg" after the program is killed it states "Out of memory".

We have no problem running the demo data ("data_endoscope.tif")

We also have the same problem every once in a while running demo_pipeline.py

Thanks!

Copy error log below (caiman) restrepd@sphgpu:~/caiman_data/demos/general$ python drg_pipeline_cnmfE.py 2022-10-31 09:12:52.163173: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE4.1 SSE4.2 AVX AVX2 AVX512F AVX512_VNNI FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2022-10-31 09:12:52.239242: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0. WARNING:root:Movie average is negative. Removing 1st percentile. WARNING:root:Movie average is negative. Removing 1st percentile. WARNING:root:Movie average is negative. Removing 1st percentile. /home/restrepd/caiman_data/demos/general/drg_pipeline_cnmfE.py:118: DeprecationWarning: np.int is a deprecated alias for the builtin int. To silence this warning, use int by itself. Doing this will not modify any behavior and is safe. When replacing np.int, you may wish to use e.g. np.int64 or np.int32 to specify the precision. If you wish to review your current use, check the release note link for additional information. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations bord_px = np.ceil(np.max(np.abs(mc.shifts_rig))).astype(np.int) Killed

dmesg output:

[3286529.735885] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,global_oom,task_memcg=/user.slice/user-1000.slice/session-c3.scope,task=python,pid=275594,uid=1000 [3286529.735937] Out of memory: Killed process 275594 (python) total-vm:154079620kB, anon-rss:57028920kB, file-rss:3092kB, shmem-rss:0kB, UID:1000 pgtables:222160kB oom_score_adj:0 [3286532.242727] oom_reaper: reaped process 275594 (python), now anon-rss:0kB, file-rss:0kB, shmem-rss:8kB

If you're not reporting an error, type your message below

kushalkolar commented 1 year ago

how big is your movie and how much RAM does the computer have?

restrepd commented 1 year ago

The miniscope .avi movie is 9.6 GB, 600x600 pixels x 53610 images.

grep MemTotal /proc/meminfo MemTotal: 263733696 kB

By the way, using htop we found that part of the problem was that after CaImAn processing with demo_pipeline.py there was residual use of memory. We killed this residual memory by sudo pkill -KILL -u {username}. Now processing of 2P data with demo_pipeline.py is working well, and we always check for residual memory usage.

However, we still have the problem processing the miniscope 9.6 GB.avi file using demo_pipeline_cnmfE.py. When we run it we monitor memory usage with htop and the usage keeps increasing until it goes above 264 GB. At that point the program crashes.

If we crop the miniscope .avi file to 400x400x53610 processing works well with demo_pipeline_cnmfE.py.

Incidentally, in this troubleshooting we are also monitoring GPU usage with nvtop. There is no GPU usage. I wonder if I have a problem with tensorflow.

Best regards,

Diego

Diego Restrepo, PhD Professor of Cell and Developmental Biology http://www.restrepolab.org/ University of Colorado Anschutz Medical Campus Department of Cell and Developmental Biology MS 8108 Bldg RC1 South, Room L18-11119 12801 E 17th Ave Aurora, CO 80045

Tel: 303-724-3405 Fax:303-724-3420

From: Kushal Kolar @.> Date: Monday, October 31, 2022 at 9:30 PM To: flatironinstitute/CaImAn @.> Cc: Restrepo, Diego @.>, Author @.> Subject: Re: [flatironinstitute/CaImAn] Out of memory with demo_pipeline_cnmfE.py (Issue #1016) [External Email - Use Caution]

how big is your movie and how much RAM does the computer have?

— Reply to this email directly, view it on GitHubhttps://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fflatironinstitute%2FCaImAn%2Fissues%2F1016%23issuecomment-1297977435&data=05%7C01%7Cdiego.restrepo%40cuanschutz.edu%7Cc873a8ee3d9e42bb45b708dabbb9762a%7C563337caa517421aaae01aa5b414fd7f%7C0%7C0%7C638028702504983097%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=aiYsAPxFVvnnyHESd%2BnEM8RtmaFhk%2BDtO09gB%2Bxwyco%3D&reserved=0, or unsubscribehttps://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FA2FV7SFMWEVXY36UAKMJ3KDWGCFGJANCNFSM6AAAAAARTGHT34&data=05%7C01%7Cdiego.restrepo%40cuanschutz.edu%7Cc873a8ee3d9e42bb45b708dabbb9762a%7C563337caa517421aaae01aa5b414fd7f%7C0%7C0%7C638028702504983097%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=9NptFdM%2FKRXbwiA515LOLHSn80JfW7j0ygJ3qPBvB3E%3D&reserved=0. You are receiving this because you authored the thread.Message ID: @.***>

pgunn commented 1 year ago

Where in the pipeline does this happen? In particular, I wonder if it happens before the conversion to memmap files or not.

restrepd commented 1 year ago

It has happened before and after in different instances.

D

Get Outlook for iOShttps://aka.ms/o0ukef

From: Pat Gunn @.> Sent: Tuesday, November 1, 2022 9:29:54 AM To: flatironinstitute/CaImAn @.> Cc: Restrepo, Diego @.>; Author @.> Subject: Re: [flatironinstitute/CaImAn] Out of memory with demo_pipeline_cnmfE.py (Issue #1016)

[External Email - Use Caution]

Where in the pipeline does this happen? In particular, I wonder if it happens before the conversion to memmap files or not.

— Reply to this email directly, view it on GitHubhttps://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fflatironinstitute%2FCaImAn%2Fissues%2F1016%23issuecomment-1298707620&data=05%7C01%7Cdiego.restrepo%40cuanschutz.edu%7C205f19787081478dead708dabc1deda7%7C563337caa517421aaae01aa5b414fd7f%7C0%7C0%7C638029133973605115%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=bLg%2F4kuntWK6g9UhVvtdvoRFtlusrEr2lvZUugPHOVs%3D&reserved=0, or unsubscribehttps://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FA2FV7SBKQRYLE2ZRELN4GWDWGEZPFANCNFSM6AAAAAARTGHT34&data=05%7C01%7Cdiego.restrepo%40cuanschutz.edu%7C205f19787081478dead708dabc1deda7%7C563337caa517421aaae01aa5b414fd7f%7C0%7C0%7C638029133973605115%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=73j6YWKn8lx4njROMCsLwRmnG4CE4vK2u%2FbRUE5dn7o%3D&reserved=0. You are receiving this because you authored the thread.Message ID: @.***>

andparra commented 1 year ago

We ended up getting this to work on another workstation although this may help some people have a rough idea to avoid breaking up files:

Tell us a bit about your setup: Operating system (Linux/macOS/Windows): Ubuntu 22.04.1 LTS

Python version (3.x): Python 3.10.6

Working environment (Python IDE/Jupyter Notebook/other): python IDE (same problem with Jupyter notebook)

Which of the demo scripts you're using for your analysis (if applicable): demo_pipeline_cnmfE.py

CaImAn version*: caiman 1.9.11

CaImAn installation process (pip install ./pip install -e ./conda): /pip install -e .`

Tensorflow recognized our GPU (A6000 48GB) outside of CaImAn, but inside CaImAn there was no recognition of the GPU. Running the slightly modified demo_pipeline_cnmfE.py was done successfully through making a second (first was small 2GB swap partition on the OS drive) swap partition in Ubuntu with an NVME on a completely separate OS drive. The purpose was to see how much swap was needed for this avi file.

Once RAM usage was at 98%, the swap partition kickd in steadily and got to 335GB! The file took roughly an hour or so to finish. We did not monitor the code to see what part of the pipeline started the memory use although we will attempt to document this.

kushalkolar commented 1 year ago

so a 10GB avi video file would probably be represented much larger as a binary array in RAM, anyways if the RAM usage is still high after memmap creation it would be faster to reduce the number of threads you're using for CNMF. Swapping, even on very fast PCIe 4.0 NVME SSDs in RAID0, is much slower than RAM and it would probably be better to just use fewer threads.

RaymondWJang commented 10 months ago

I'm having the exact same issue. My setup is:

Operating System (Linux, MacOS, Windows): Debian GNU/Linux 12 (bookworm)
Hardware type (x86, ARM..) and RAM: x86_64, 62.7GB
Python Version (e.g. 3.9): 3.11.6
Caiman version (e.g. 1.9.12): 1.9.16
Which demo exhibits the problem (if applicable): demo_pipeline_cnmfE.py
How you installed Caiman (pure conda, conda + compile, colab, ..): mamba
Details: It's a set of 24 1000x1000x1 dimension videos with about 24000 frames each. I'm running demo_pipeline_cnmfE.py, bar some video-specific motion-correction and cnmf initialization parameters. I trialed with various n_processes values, ranging from 23 to 1. 1 was taking way too long so i landed on 5 instead. I can post the entire config sets if requested, as well.

My particular silent oom crash occurs during cnmf.fit() -> compute_W(): https://github.com/flatironinstitute/CaImAn/blob/9b0b79ca61f20ce93259b9833e1fe18e26d4e086/caiman/source_extraction/cnmf/initialization.py#L2021

Here, the script just dies in complete silence. I just kept coming back to the whole process vanished. I performed a line-by-line PDB debug with htop open to see my memory getting overflown and the process subsequently getting killed and the memory purged.

Also, there seems to be an if-else block right above that attempts to do something about memory management: https://github.com/flatironinstitute/CaImAn/blob/9b0b79ca61f20ce93259b9833e1fe18e26d4e086/caiman/source_extraction/cnmf/initialization.py#L1984 but the data_fits_in_memory variable never gets manipulated anywhere else, and it defaults to true in parameter declaration so the data_fits_in_memory==False-block never gets reached. I'm not entirely sure what's the history behind here, but when I forced the parameter value to false to force-run the block, the same silent OOM crash occurred on line 2021 again.

The main issue for me is that it raises absolutely no alarms in the process level unless you check dmesg. Maybe there could be some kind of pre-check condition for whether the system has sufficient memory for the upcoming process?

I'm happy to draft a PR if someone could point me to a relevant part of the code.

EricThomson commented 10 months ago

This is a really good idea, to figure out something more principled than just "give it a shot and run the online algorithm when it fails". That is, we should have a more principled way to actually calculate RAM needs for CNMF and CNMFE based on your file size and parameter setting (patches, subsampling, etc), and spit out if you will need to run the online algorithm based on this. My guess is this isn't an entirely trivial calculation otherwise we'd have already done it.

I'm really busy getting ready for the workshop at SFN this week, but it is something I'm interested in pursuing (or at least having a good answer why this is too hard to provide across platforms and taking into account dependency on number of CPU cores or something). I'm not sure if you have ideas about this @pgunn

pgunn commented 10 months ago

I believe data_fits_in_memory was meant to be a way for the user to signal, as an initialisation option to CNMF, that they know it won't fit. It'd be interesting (but difficult) to make it automatic. It's generally challenging for software to notice beforehand if it's going to run out of RAM - very little software out there does so.

flatironinstitute / CaImAn

Out of memory with demo_pipeline_cnmfE.py #1016