flatironinstitute / CaImAn

Computational toolbox for large scale Calcium Imaging Analysis, including movie handling, motion correction, source extraction, spike deconvolution and result visualization.
https://caiman.readthedocs.io
GNU General Public License v2.0
638 stars 370 forks source link

CNMF-E Online kills kernel #678

Closed mtugsbayar closed 4 years ago

mtugsbayar commented 4 years ago

Tell us a bit about your setup:

  1. Operating system (Linux/macOS/Windows): Linux
  2. Python version (3.x): 3.7.1
  3. Working environment (Python IDE/Jupyter Notebook/other): Jupyter Notebook
  4. Which of the demo scripts you're using for your analysis (if applicable): demo_online_CNMF-E
  5. CaImAn version*: 1.7
  6. CaImAn installation process (pip install ./pip install -e ./conda): pip install -e .

Kernel invariably dies after reaching cnm_online.fit_online(). This is regardless of whether I'm using provided demo videos or my own.

Last line before kernel dies is always: 1501827 [online_cnmf.py:fit_online():1130] [11453] Now processing file /home/ubuntu/caiman_data/example_movies/v1_bear.tiff

The 2p OnACID demo works normally. Erasing and reconfiguring the environment doesn't solve the problem. I previously had a problem with launching Jupyter which I resolved by pip installing environment_manager.

epnev commented 4 years ago

@mtugsbayar Are there any logging messages before the kernel crash?

epnev commented 4 years ago

Sorry, just saw you had this info already included. Will take a look.

pgunn commented 4 years ago

Can you run the dmesg command after such a crash and see if anything interesting that looks relevant to the crash is present?

mtugsbayar commented 4 years ago

I ran dmesg --level=err,warn after a crash and got this:

[ 1.181351] MDS CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html for more details. [ 1.195173] #5 [ 1.198447] #6 [ 1.199852] #7 [ 1.305654] PCCT header not found. [ 1.374939] acpi PNP0A03:00: fail to add MMCONFIG information, can't access extended PCI configuration space under thisbridge. [ 1.613363] ACPI: Enabled 16 GPEs in block 00 to 0F [ 2.064043] ACPI: PCI Interrupt Link [LNKD] enabled at IRQ 11 [ 2.181169] i8042: Warning: Keylock active [ 7.753781] systemd[1]: [/lib/systemd/system/ibacm.socket:33] Failed to parse address value, ignoring: rdma 4 [ 7.800534] systemd[1]: [/etc/systemd/system/snap-core-8268.mount:10] Unknown lvalue 'LazyUnmount' in section 'Mount' [ 7.808238] systemd[1]: [/etc/systemd/system/snap-amazon\x2dssm\x2dagent-1480.mount:10] Unknown lvalue 'LazyUnmount' insection 'Mount' [ 7.816878] systemd[1]: [/etc/systemd/system/snap-core-8213.mount:10] Unknown lvalue 'LazyUnmount' in section 'Mount' [ 8.407353] nvidia: loading out-of-tree module taints kernel. [ 8.407357] nvidia: module license 'NVIDIA' taints kernel. [ 8.407358] Disabling lock debugging due to kernel taint [ 8.422360] ACPI: PCI Interrupt Link [LNKB] enabled at IRQ 10 [ 8.466710] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 418.87.01 Wed Sep 25 06:00:38 UTC 2019 [ 15.055563] cgroup: new mount options do not match the existing superblock, will be ignored [ 16.289360] NVRM: Persistence mode is deprecated and will be removed in a future release. Please use nvidia-persistenced instead.

I also ran jupyter with --debug. The only interesting thing I saw before the kernel restart was this:

OpenCV: FFMPEG: tag 0x34363248/'H264' is not supported with codec id 27 and format 'mp4 / MP4 (MPEG-4 Part 14)' OpenCV: FFMPEG: fallback to use tag 0x31637661/'avc1' [D 18:57:33.286 NotebookApp] activity on 0ae7ff88-b1ed-4aec-8ebc-b61ac1cc3468: stream

epnev commented 4 years ago

@mtugsbayar can you paste the value of opts.params.online right before you call the fit_online function?

mtugsbayar commented 4 years ago

opts.params.online returns an error. cnm.params.online returns:

'N_samples_exceptionality': 12, 'batch_update_suff_stat': False, 'dist_shape_update': False, 'ds_factor': 1, 'epochs': 1, 'expected_comps': 500, 'full_XXt': False, 'init_batch': 1000, 'init_method': 'bare', 'iters_shape': 5, 'max_comp_update_shape': inf, 'max_num_added': 5, 'max_shifts_online': 10, 'min_SNR': 1.5, 'min_num_trial': 5, 'minibatch_shape': 100, 'minibatch_suff_stat': 5, 'motion_correct': True, 'movie_name_online': '/home/ubuntu/caiman_data/example_movies/online_movie.mp4', 'normalize': False, 'n_refit': 0, 'num_times_comp_updated': inf, 'opencv_codec': 'H264', 'path_to_model': '/home/ubuntu/caiman_data/model/cnn_model_online.h5', 'rval_thr': 0.85, 'save_online_movie': False, 'show_movie': False, ' simultaneously': False, 'sniper_mode': False, 'test_both': False, 'thresh_CNN_noisy': 0.5, 'thresh_fitness_delta': -50, 'thresh_fitness_raw': -60.97977932734429, 'thresh_overlap': 0.5, 'update_freq': 200, 'update_num_comps': True, 'use_corr_img': False, 'use_dense': True, 'use_peak_max': True, 'W_update_factor': 1

epnev commented 4 years ago

@mtugsbayar I checked the notebook and I think I know what's going on. The parameter setting in the notebook is such that a movie will get created and displayed during the online processing. For some reason this seems to break your kernel based on the error log. Change the setting such that you have (with the variable naming used in the notebook)

online_opts.online['show_movie'] = False

in the params before you call the fit_online method. That should do it.

mtugsbayar commented 4 years ago

Thank you so much! That does fix the kernel death problem.

I'm running into another problem with the demo movie. It seems to stop iterating prematurely, finding only the neurons on the top left quadrant of the movie. Should I discuss it here or should I open another issue?

The error log is: StopIteration Traceback (most recent call last) ~/CaImAn/caiman/base/movies.py in load_iter(file_name, subindices, var_name_hdf5) 2098 cap.release() -> 2099 raise StopIteration 2100 elif extension in ('.hdf5', '.h5'):

StopIteration:

The above exception was the direct cause of the following exception:

RuntimeError Traceback (most recent call last)

in 1 cnm_online = cnmf.online_cnmf.OnACID(params=online_opts, dview=dview) ----> 2 cnm_online.fit_online() ~/CaImAn/caiman/source_extraction/cnmf/online_cnmf.py in fit_online(self, **kwargs) 1137 while True: # process each file 1138 try: -> 1139 frame = next(Y_) 1140 frame_count += 1 1141 t_frame_start = time() RuntimeError: generator raised StopIteration
epnev commented 4 years ago

@mtugsbayar Haven't seen that one before but perhaps it's related to this: https://stackoverflow.com/questions/51700960/runtimeerror-generator-raised-stopiteration-every-time-i-try-to-run-app If you want, try the solution there otherwise I'll try to reproduce this tomorrow.

epnev commented 4 years ago

@mtugsbayar Do you get this error on the file provided in the demo or your own file (or both)?

mtugsbayar commented 4 years ago

I got this error on the provided demo file repeatedly.

I tried it on our data as well. It didn't crash, but it's been 4 hours and my video is still stuck on online processing. The log said it was processing my file three times.

mtugsbayar commented 4 years ago

Update: it actually works on custom data, just takes a very long time. Are there RAM requirements that I should be aware of?

pgunn commented 4 years ago

How much RAM do you have?

mtugsbayar commented 4 years ago

32GB. I have a GPU as well, but I don't think CaImAn uses that yet?

epnev commented 4 years ago

No, that should be enough. I suspect some issues with the iterator and python 3.7. Will have to check

@j-friedrich Have you encountered this?

j-friedrich commented 4 years ago

Didn’t run into it myself, but managed to reproduce the error for python 3.7 and avi files.

Found this on https://docs.python.org/3/whatsnew/3.7.html PEP 479 https://www.python.org/dev/peps/pep-0479 is enabled for all code in Python 3.7, meaning that StopIteration https://docs.python.org/3/library/exceptions.html#StopIteration exceptions raised directly or indirectly in coroutines and generators are transformed into RuntimeError https://docs.python.org/3/library/exceptions.html#RuntimeError exceptions.

I.e. the StopIteration in line 1224 of online_cnmf.py needs to be replaced by RuntimeError. For backward compatibility it needs also to be replaced in function load_iter of movies.py (lines 2076, 2097, 2099)

That solved the error for me.

On Dec 19, 2019, at 1:45 AM, eftychios pnevmatikakis notifications@github.com wrote:

No, that should be enough. I suspect some issues with the iterator and python 3.7. Will have to check

@j-friedrich https://github.com/j-friedrich Have you encountered this?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/flatironinstitute/CaImAn/issues/678?email_source=notifications&email_token=AC2TI2AXEZDAWPXHT5NWXS3QZK7ZRA5CNFSM4J3UDPS2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHH7QZY#issuecomment-567277671, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC2TI2EWQ4Y3XBY4HOLMB6LQZK7ZRANCNFSM4J3UDPSQ.

epnev commented 4 years ago

@j-friedrich That's a bit weird. In my case I had to change the movies.py file and use return instead of raising a StopIteration exception. However, when I modified the online_cnmf.py file I ran into a different error.

j-friedrich commented 4 years ago

Ups, didn’t check that changing the explicitly raised Exceptions to RuntimeError works for avis, but breaks it for tif files, where still a StopIteration is raised implicitly.

Simply changing only the Exception clause in online_cnmf.py to except (StopIteration, RuntimeError): works for avis and tifs on py3.6 and py3.7

On Dec 19, 2019, at 5:13 PM, eftychios pnevmatikakis notifications@github.com wrote:

@j-friedrich https://github.com/j-friedrich That's a bit weird. In my case I had to change the movies.py file and use return instead of raising a StopIteration exception. However, when I modified the online_cnmf.py file I ran into a different error.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/flatironinstitute/CaImAn/issues/678?email_source=notifications&email_token=AC2TI2BLSZVLS5CC5H5J7FTQZOMRDA5CNFSM4J3UDPS2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHKDLDY#issuecomment-567555471, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC2TI2FILJHATMRAID7J5B3QZOMRDANCNFSM4J3UDPSQ.

epnev commented 4 years ago

@mtugsbayar A fix is on the dev branch if you want to try it out. Will merge into master later today.

epnev commented 4 years ago

The fix is now on master. Please update your code and let us know whether it worked for you.

mtugsbayar commented 4 years ago

Hello! Sorry for the delay! The demo works fine now on the intended movie, and it seems reasonably fast compared to before.

Also for reference, I think my original kernel death issue was because of faulty X11 forwarding. I'm running the demo on remote, so when show_movie tries to forward the movie to my local computer and Xming is not initialized, it kills my kernel.