Closed mtugsbayar closed 4 years ago
@mtugsbayar Are there any logging messages before the kernel crash?
Sorry, just saw you had this info already included. Will take a look.
Can you run the dmesg command after such a crash and see if anything interesting that looks relevant to the crash is present?
I ran dmesg --level=err,warn after a crash and got this:
[ 1.181351] MDS CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html for more details. [ 1.195173] #5 [ 1.198447] #6 [ 1.199852] #7 [ 1.305654] PCCT header not found. [ 1.374939] acpi PNP0A03:00: fail to add MMCONFIG information, can't access extended PCI configuration space under thisbridge. [ 1.613363] ACPI: Enabled 16 GPEs in block 00 to 0F [ 2.064043] ACPI: PCI Interrupt Link [LNKD] enabled at IRQ 11 [ 2.181169] i8042: Warning: Keylock active [ 7.753781] systemd[1]: [/lib/systemd/system/ibacm.socket:33] Failed to parse address value, ignoring: rdma 4 [ 7.800534] systemd[1]: [/etc/systemd/system/snap-core-8268.mount:10] Unknown lvalue 'LazyUnmount' in section 'Mount' [ 7.808238] systemd[1]: [/etc/systemd/system/snap-amazon\x2dssm\x2dagent-1480.mount:10] Unknown lvalue 'LazyUnmount' insection 'Mount' [ 7.816878] systemd[1]: [/etc/systemd/system/snap-core-8213.mount:10] Unknown lvalue 'LazyUnmount' in section 'Mount' [ 8.407353] nvidia: loading out-of-tree module taints kernel. [ 8.407357] nvidia: module license 'NVIDIA' taints kernel. [ 8.407358] Disabling lock debugging due to kernel taint [ 8.422360] ACPI: PCI Interrupt Link [LNKB] enabled at IRQ 10 [ 8.466710] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 418.87.01 Wed Sep 25 06:00:38 UTC 2019 [ 15.055563] cgroup: new mount options do not match the existing superblock, will be ignored [ 16.289360] NVRM: Persistence mode is deprecated and will be removed in a future release. Please use nvidia-persistenced instead.
I also ran jupyter with --debug. The only interesting thing I saw before the kernel restart was this:
OpenCV: FFMPEG: tag 0x34363248/'H264' is not supported with codec id 27 and format 'mp4 / MP4 (MPEG-4 Part 14)' OpenCV: FFMPEG: fallback to use tag 0x31637661/'avc1' [D 18:57:33.286 NotebookApp] activity on 0ae7ff88-b1ed-4aec-8ebc-b61ac1cc3468: stream
@mtugsbayar can you paste the value of opts.params.online
right before you call the fit_online
function?
opts.params.online returns an error. cnm.params.online returns:
'N_samples_exceptionality': 12, 'batch_update_suff_stat': False, 'dist_shape_update': False, 'ds_factor': 1, 'epochs': 1, 'expected_comps': 500, 'full_XXt': False, 'init_batch': 1000, 'init_method': 'bare', 'iters_shape': 5, 'max_comp_update_shape': inf, 'max_num_added': 5, 'max_shifts_online': 10, 'min_SNR': 1.5, 'min_num_trial': 5, 'minibatch_shape': 100, 'minibatch_suff_stat': 5, 'motion_correct': True, 'movie_name_online': '/home/ubuntu/caiman_data/example_movies/online_movie.mp4', 'normalize': False, 'n_refit': 0, 'num_times_comp_updated': inf, 'opencv_codec': 'H264', 'path_to_model': '/home/ubuntu/caiman_data/model/cnn_model_online.h5', 'rval_thr': 0.85, 'save_online_movie': False, 'show_movie': False, ' simultaneously': False, 'sniper_mode': False, 'test_both': False, 'thresh_CNN_noisy': 0.5, 'thresh_fitness_delta': -50, 'thresh_fitness_raw': -60.97977932734429, 'thresh_overlap': 0.5, 'update_freq': 200, 'update_num_comps': True, 'use_corr_img': False, 'use_dense': True, 'use_peak_max': True, 'W_update_factor': 1
@mtugsbayar I checked the notebook and I think I know what's going on. The parameter setting in the notebook is such that a movie will get created and displayed during the online processing. For some reason this seems to break your kernel based on the error log. Change the setting such that you have (with the variable naming used in the notebook)
online_opts.online['show_movie'] = False
in the params before you call the fit_online
method. That should do it.
Thank you so much! That does fix the kernel death problem.
I'm running into another problem with the demo movie. It seems to stop iterating prematurely, finding only the neurons on the top left quadrant of the movie. Should I discuss it here or should I open another issue?
The error log is: StopIteration Traceback (most recent call last) ~/CaImAn/caiman/base/movies.py in load_iter(file_name, subindices, var_name_hdf5) 2098 cap.release() -> 2099 raise StopIteration 2100 elif extension in ('.hdf5', '.h5'):
StopIteration:
The above exception was the direct cause of the following exception:
RuntimeError Traceback (most recent call last)
@mtugsbayar Haven't seen that one before but perhaps it's related to this: https://stackoverflow.com/questions/51700960/runtimeerror-generator-raised-stopiteration-every-time-i-try-to-run-app If you want, try the solution there otherwise I'll try to reproduce this tomorrow.
@mtugsbayar Do you get this error on the file provided in the demo or your own file (or both)?
I got this error on the provided demo file repeatedly.
I tried it on our data as well. It didn't crash, but it's been 4 hours and my video is still stuck on online processing. The log said it was processing my file three times.
Update: it actually works on custom data, just takes a very long time. Are there RAM requirements that I should be aware of?
How much RAM do you have?
32GB. I have a GPU as well, but I don't think CaImAn uses that yet?
No, that should be enough. I suspect some issues with the iterator and python 3.7. Will have to check
@j-friedrich Have you encountered this?
Didn’t run into it myself, but managed to reproduce the error for python 3.7 and avi files.
Found this on https://docs.python.org/3/whatsnew/3.7.html PEP 479 https://www.python.org/dev/peps/pep-0479 is enabled for all code in Python 3.7, meaning that StopIteration https://docs.python.org/3/library/exceptions.html#StopIteration exceptions raised directly or indirectly in coroutines and generators are transformed into RuntimeError https://docs.python.org/3/library/exceptions.html#RuntimeError exceptions.
I.e. the StopIteration in line 1224 of online_cnmf.py needs to be replaced by RuntimeError. For backward compatibility it needs also to be replaced in function load_iter of movies.py (lines 2076, 2097, 2099)
That solved the error for me.
On Dec 19, 2019, at 1:45 AM, eftychios pnevmatikakis notifications@github.com wrote:
No, that should be enough. I suspect some issues with the iterator and python 3.7. Will have to check
@j-friedrich https://github.com/j-friedrich Have you encountered this?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/flatironinstitute/CaImAn/issues/678?email_source=notifications&email_token=AC2TI2AXEZDAWPXHT5NWXS3QZK7ZRA5CNFSM4J3UDPS2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHH7QZY#issuecomment-567277671, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC2TI2EWQ4Y3XBY4HOLMB6LQZK7ZRANCNFSM4J3UDPSQ.
@j-friedrich That's a bit weird. In my case I had to change the movies.py file and use return
instead of raising a StopIteration exception. However, when I modified the online_cnmf.py
file I ran into a different error.
Ups, didn’t check that changing the explicitly raised Exceptions to RuntimeError works for avis, but breaks it for tif files, where still a StopIteration is raised implicitly.
Simply changing only the Exception clause in online_cnmf.py to except (StopIteration, RuntimeError): works for avis and tifs on py3.6 and py3.7
On Dec 19, 2019, at 5:13 PM, eftychios pnevmatikakis notifications@github.com wrote:
@j-friedrich https://github.com/j-friedrich That's a bit weird. In my case I had to change the movies.py file and use return instead of raising a StopIteration exception. However, when I modified the online_cnmf.py file I ran into a different error.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/flatironinstitute/CaImAn/issues/678?email_source=notifications&email_token=AC2TI2BLSZVLS5CC5H5J7FTQZOMRDA5CNFSM4J3UDPS2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHKDLDY#issuecomment-567555471, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC2TI2FILJHATMRAID7J5B3QZOMRDANCNFSM4J3UDPSQ.
@mtugsbayar A fix is on the dev branch if you want to try it out. Will merge into master later today.
The fix is now on master. Please update your code and let us know whether it worked for you.
Hello! Sorry for the delay! The demo works fine now on the intended movie, and it seems reasonably fast compared to before.
Also for reference, I think my original kernel death issue was because of faulty X11 forwarding. I'm running the demo on remote, so when show_movie tries to forward the movie to my local computer and Xming is not initialized, it kills my kernel.
Tell us a bit about your setup:
pip install .
/pip install -e .
/conda): pip install -e .Kernel invariably dies after reaching cnm_online.fit_online(). This is regardless of whether I'm using provided demo videos or my own.
Last line before kernel dies is always: 1501827 [online_cnmf.py:fit_online():1130] [11453] Now processing file /home/ubuntu/caiman_data/example_movies/v1_bear.tiff
The 2p OnACID demo works normally. Erasing and reconfiguring the environment doesn't solve the problem. I previously had a problem with launching Jupyter which I resolved by pip installing environment_manager.