Cryptic errors when CNMFE params are incompatible

kushalkolar commented 1 year ago

Troubleshooting mescore CI and ran into this, reproduced on CNMFE demo nb by setting gSig = (10, 10) and gSiz = (41, 41), everything else is the same. These params are quite far from ideal for this movie, but it's weird that a pickling error is thrown:

The reason is because gSiz > rf, but this type of error is very cryptic. PR incoming...

Side note, dview.map_async(cnmf_patches, args_in).get(4294967) is retrieving chunks of size 2^32 - 1, curious why that's the chunk size?

---------------------------------------------------------------------------
MaybeEncodingError                        Traceback (most recent call last)
Cell In[32], line 2
      1 cnm = cnmf.CNMF(n_processes=n_processes, dview=dview, Ain=Ain, params=opts)
----> 2 cnm.fit(images)

File ~/repos/CaImAn/caiman/source_extraction/cnmf/cnmf.py:594, in CNMF.fit(self, images, indices)
    589 if not isinstance(images, np.memmap):
    590     raise Exception(
    591         'You need to provide a memory mapped file as input if you use patches!!')
    593 self.estimates.A, self.estimates.C, self.estimates.YrA, self.estimates.b, self.estimates.f, \
--> 594     self.estimates.sn, self.estimates.optional_outputs = run_CNMF_patches(
    595         images.filename, self.dims + (T,), self.params,
    596         dview=self.dview, memory_fact=self.params.get('patch', 'memory_fact'),
    597         gnb=self.params.get('init', 'nb'), border_pix=self.params.get('patch', 'border_pix'),
    598         low_rank_background=self.params.get('patch', 'low_rank_background'),
    599         del_duplicates=self.params.get('patch', 'del_duplicates'),
    600         indices=indices)
    602 self.estimates.bl, self.estimates.c1, self.estimates.g, self.estimates.neurons_sn = None, None, None, None
    603 logging.info("merging")

File ~/repos/CaImAn/caiman/source_extraction/cnmf/map_reduce.py:250, in run_CNMF_patches(file_name, shape, params, gnb, dview, memory_fact, border_pix, low_rank_background, del_duplicates, indices)
    248 if dview is not None:
    249     if 'multiprocessing' in str(type(dview)):
--> 250         file_res = dview.map_async(cnmf_patches, args_in).get(4294967)
    251     else:
    252         try:

File /usr/lib/python3.11/multiprocessing/pool.py:774, in ApplyResult.get(self, timeout)
    772     return self._value
    773 else:
--> 774     raise self._value

MaybeEncodingError: Error sending result: '<multiprocessing.pool.ExceptionWithTraceback object at 0x7f2db6da91d0>'. Reason: 'PicklingError("Can't pickle <class '_flapack.error'>: import of module '_flapack' failed")'

kushalkolar commented 1 year ago

And there's another one I'm trying to figure out!

Traceback (most recent call last):
  File "/home/kushal/repos/mesmerize-core/mesmerize_core/algorithms/cnmfe.py", line 89, in run_algo
    cnm = cnm.fit(images)
          ^^^^^^^^^^^^^^^
  File "/home/kushal/repos/CaImAn/caiman/source_extraction/cnmf/cnmf.py", line 594, in fit
    self.estimates.sn, self.estimates.optional_outputs = run_CNMF_patches(
                                                         ^^^^^^^^^^^^^^^^^
  File "/home/kushal/repos/CaImAn/caiman/source_extraction/cnmf/map_reduce.py", line 476, in run_CNMF_patches
    B_tot = np.array(B_tot, dtype=np.float32)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: float() argument must be a string or a real number, not 'coo_matrix'

EDIT: So this happens if nb=0 is not explicitly specified. Weird because our tests were passing in March without this being explicitly specified.

EricThomson commented 1 year ago

I have often wondered about that 4_294_967. It would be good to document that line of code, in particular that magic number. I finally took the time to research it -- looking up stuff here: https://docs.python.org/3/library/multiprocessing.html https://superfastpython.com/multiprocessing-pool-map_async/

For people reading this (e.g., me) in the future, map_asynch() is a non-blocking function (as opposed to map()). get() is an option that returns the value, and the argument says how long to wait in seconds before throwing a multiprocessing.TimeoutError. By setting it to something that large (basically 50 days!) it is effectively saying go forever. The thing is, we could just not put any argument in there for that and it would be the same. Or we could be more explicit and use the timeout=XX so it is more clear.

Anyway, glad you brought this up because that has always confused me but I never took the time to actually research it I assumed it was some memory management thing :smile:

pgunn commented 1 year ago

Probably worth removing the argument, now that we understand what it is.

kushalkolar commented 1 year ago

Ah ok! It's timeout, not chunk_size! :smile:

Maybe the reason they used 2^32 - 1 is that this is the maximum timeout that can be specified, because 2^32 would overflow?

Maybe that timeout exists for a reason, perhaps there's actually some delay with getting the output from map_async and that arg was put there to make sure it's gotten.

I've sometimes heard of people who've had CNMF hang indefinitely, maybe this could've been the source?

Anyways, maybe we want to keep that arg but make it something reasonable, like 1 minute? Or a few minutes to be safe? run_CNMF_patches is in a deep nest so I kinda don't want to benchmark that thing to figure out what the real get() times are :)

EricThomson commented 1 year ago

I kinda don't want to benchmark that thing to figure out what the real get() times are

🤣 I will be the benchmarking hero.

jk nope it will always be a guessing game that's probably what they did pick the most unreasonably long time so it will never get a complaint :smile: Unless Pat does some AWS magic profiling with it.

kushalkolar commented 1 year ago

I have found an issue where it indeed does hang here, but it's from 2018 so not sure how applicable it is: https://github.com/flatironinstitute/CaImAn/issues/301

Note: This person's real issue was bizarre (python 2 and 3 mixed, how is that even possible?). But maybe get() will just hang in weird situations.

Anyways, multiprocessing will throw a TimeoutError if the timeout is exceeded, so we can do something reasonable like 5 mins and leave it at that?

pgunn commented 1 year ago

I don't think a shorter timeout really solves anything (probably), it'd just give us a chance to spit out some kind of "I don't know what happened but things broke in this here code" message. It's probably always a bug if we pick a timeout and it goes over.

kushalkolar commented 1 year ago

Yea rather than hanging it'd basically be an indicator of "this is taking much longer than it should, something is probably wrong", verus the user having to do a manual keyboard interrupt because they noticed it was going for too long.

kushalkolar commented 1 year ago

So the gSiz > rf is caught in #1186 , but we still need to decide a timeout.

ethanbb commented 3 months ago

Looking back at the error:

TypeError: float() argument must be a string or a real number, not 'coo_matrix'

It seems this happens if nb != 0 AND low_rank_background is false (default is true):

# in map_reduce.py
if count_bgr == 0:
    # corresponds to nb == 0
    # OK
    ...
elif low_rank_background is None:
    # OK
    ...
elif low_rank_background:
    # OK
    ...
else:
    ...
    # error happens here (line 456-457); B_tot starts as a scipy.sparse.csc_matrix:
    B_tot /= nB
    B_tot = np.array(B_tot, dtype=np.float32)

Maybe scipy.sparse used to support that, I don't know. But to fix it, assuming a dense array is actually necessary there, we should just replace that last line with B_tot = B_tot.toarray().astype(np.float32).

flatironinstitute / CaImAn

Cryptic errors when CNMFE params are incompatible #1185