Open kushalkolar opened 1 year ago
And there's another one I'm trying to figure out!
Traceback (most recent call last):
File "/home/kushal/repos/mesmerize-core/mesmerize_core/algorithms/cnmfe.py", line 89, in run_algo
cnm = cnm.fit(images)
^^^^^^^^^^^^^^^
File "/home/kushal/repos/CaImAn/caiman/source_extraction/cnmf/cnmf.py", line 594, in fit
self.estimates.sn, self.estimates.optional_outputs = run_CNMF_patches(
^^^^^^^^^^^^^^^^^
File "/home/kushal/repos/CaImAn/caiman/source_extraction/cnmf/map_reduce.py", line 476, in run_CNMF_patches
B_tot = np.array(B_tot, dtype=np.float32)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: float() argument must be a string or a real number, not 'coo_matrix'
EDIT: So this happens if nb=0
is not explicitly specified. Weird because our tests were passing in March without this being explicitly specified.
I have often wondered about that 4_294_967
. It would be good to document that line of code, in particular that magic number. I finally took the time to research it -- looking up stuff here:
https://docs.python.org/3/library/multiprocessing.html
https://superfastpython.com/multiprocessing-pool-map_async/
For people reading this (e.g., me) in the future, map_asynch()
is a non-blocking function (as opposed to map()
). get()
is an option that returns the value, and the argument says how long to wait in seconds before throwing a multiprocessing.TimeoutError
. By setting it to something that large (basically 50 days!) it is effectively saying go forever. The thing is, we could just not put any argument in there for that and it would be the same. Or we could be more explicit and use the timeout=XX
so it is more clear.
Anyway, glad you brought this up because that has always confused me but I never took the time to actually research it I assumed it was some memory management thing :smile:
Probably worth removing the argument, now that we understand what it is.
Ah ok! It's timeout, not chunk_size! :smile:
Maybe the reason they used 2^32 - 1 is that this is the maximum timeout that can be specified, because 2^32 would overflow?
Maybe that timeout exists for a reason, perhaps there's actually some delay with getting the output from map_async
and that arg was put there to make sure it's gotten.
I've sometimes heard of people who've had CNMF hang indefinitely, maybe this could've been the source?
Anyways, maybe we want to keep that arg but make it something reasonable, like 1 minute? Or a few minutes to be safe? run_CNMF_patches
is in a deep nest so I kinda don't want to benchmark that thing to figure out what the real get()
times are :)
I kinda don't want to benchmark that thing to figure out what the real get() times are
🤣 I will be the benchmarking hero.
jk nope it will always be a guessing game that's probably what they did pick the most unreasonably long time so it will never get a complaint :smile: Unless Pat does some AWS magic profiling with it.
I have found an issue where it indeed does hang here, but it's from 2018 so not sure how applicable it is: https://github.com/flatironinstitute/CaImAn/issues/301
Note: This person's real issue was bizarre (python 2 and 3 mixed, how is that even possible?). But maybe get()
will just hang in weird situations.
Anyways, multiprocessing will throw a TimeoutError
if the timeout is exceeded, so we can do something reasonable like 5 mins and leave it at that?
I don't think a shorter timeout really solves anything (probably), it'd just give us a chance to spit out some kind of "I don't know what happened but things broke in this here code" message. It's probably always a bug if we pick a timeout and it goes over.
Yea rather than hanging it'd basically be an indicator of "this is taking much longer than it should, something is probably wrong", verus the user having to do a manual keyboard interrupt because they noticed it was going for too long.
So the gSiz > rf is caught in #1186 , but we still need to decide a timeout.
Looking back at the error:
TypeError: float() argument must be a string or a real number, not 'coo_matrix'
It seems this happens if nb != 0
AND low_rank_background
is false (default is true):
# in map_reduce.py
if count_bgr == 0:
# corresponds to nb == 0
# OK
...
elif low_rank_background is None:
# OK
...
elif low_rank_background:
# OK
...
else:
...
# error happens here (line 456-457); B_tot starts as a scipy.sparse.csc_matrix:
B_tot /= nB
B_tot = np.array(B_tot, dtype=np.float32)
Maybe scipy.sparse used to support that, I don't know. But to fix it, assuming a dense array is actually necessary there, we should just replace that last line with B_tot = B_tot.toarray().astype(np.float32)
.
Troubleshooting mescore CI and ran into this, reproduced on CNMFE demo nb by setting
gSig = (10, 10)
andgSiz = (41, 41)
, everything else is the same. These params are quite far from ideal for this movie, but it's weird that a pickling error is thrown:The reason is because
gSiz > rf
, but this type of error is very cryptic. PR incoming...Side note,
dview.map_async(cnmf_patches, args_in).get(4294967)
is retrieving chunks of size 2^32 - 1, curious why that's the chunk size?