Aharoni-Lab / miniscope-io

Data formatting, reading, and writing from miniscopes
https://miniscope-io.readthedocs.io
GNU Affero General Public License v3.0
6 stars 2 forks source link

make streamdaq run good #27

Open sneakers-the-rat opened 2 months ago

sneakers-the-rat commented 2 months ago

profiling the processes and it looks like we got lots of bottlenecks.

problem: the test in https://github.com/Aharoni-Lab/miniscope-io/pull/26 processes like 100 frames, so it should be basically instantaneous. instead it takes like 15s. that won't cut it for realtime usage because 15s > 5s.

bottlenecks:

this is the first time i've taken a look at this method, but it's extremely expensive in a bunch of ways:

so i'm not really sure what that _format_frame method is doing, but it's making the daq much slower than realtime. maybe the place to start is for whoever wrote that to describe what's going on there so we can refactor it?

@phildong @MarcelMB @t-sasatani any insight?

t-sasatani commented 2 months ago

This is interesting and looks pretty critical. I think I'm responsible for most of this. The code around queues hasn't changed much since I first quickly wrote this for 1FPS transfer, and I don't remember much about it to be honest.

sneakers-the-rat commented 2 months ago

I think ideally for tests we want small data, but if we want to do long-running tests we can synthesize data (so it would be good to have a format that could generate as well as parse data! in time.)

sneakers-the-rat commented 2 months ago

OK i started working on this here: https://github.com/Aharoni-Lab/miniscope-io/tree/perf-streamdaq

it just seems like we have a sorta bad division of labor that's forcing a bit of inconsistency.

Here's how the SDCard.read method works:

Currently the stream_daq pipeline is like:

So basically what needs to happen is

but there's enough magic happening in the middle that i can't quite get it working right now.

there should basically be only one additional process, and all it should be doing is grabbing the bitstream from the device and shoving it in as large of a queue as can fit in memory so that we don't lose any buffers from the device. the rest doesn't really benefit from multiprocessing (transferring data between processes is expensive).

in fact since each of the buffers have a very small max size, and the last step in the pipeline is the limiting step, every other step will hang on the put and get calls, which probably explains the large number of missed buffers here: aka the current design makes it so that we not only can only process at 1/3 realtime, but we can only acquire from the device at 1/3 realtime.

t-sasatani commented 2 months ago

Thanks for looking into this! I'm starting to recall how this worked as well.

t-sasatani commented 2 months ago

My bad, I should have made a branch from main for the commit I just made and merged a962855. I merged anyways but pls feel free to just revert it.

sneakers-the-rat commented 2 months ago

We can't always expect a complete metadata header in our project.

Hell yes!!!! Lets make something robust and nimble. I definitely appreciate how careful the current code is. Lets try and refine the transmission process with the right balance of robustness on the transmission and reception side.

This test data is around the speed where the analog processing circuit starts showing errors so I think both causes are contributing to the corruption.

Also hell yes!!! Lets try and isolate each of the sources of problems by making each robust, independently tested and optimized!

phildong commented 2 months ago

Just throwing in whatever I can remember on this when I was working on the codebase:

  1. I remember there was no reason buffer_to_frame and format_frame should be two separate process/function call and marked it as future refactoring task in my head which never happened :), so yeah totally agree with @sneakers-the-rat some merging should be happening here.
  2. The only reason I added in all the Bits BitArray stuff is because when I was working on this, in some version of the image sensor MCU firmware, the bit order within every single 32-bit word is reversed (like LSB but at word level). And I just couldn't figure out an easy way to do it with native python/numpy. However this seems no longer true according to https://github.com/Aharoni-Lab/miniscope-io/pull/17. If that's the case, I believe we can refactor and completely take out the dependency on bitarray.

It's been a while and I barely remember any details, but let me know what I can help and I can dig in!

t-sasatani commented 2 months ago

I was playing around with the branch @sneakers-the-rat left and updated the handling around bits/bytes. Now, it passes tests and takes about 2 seconds to process a 5-second video frame on my PC, which is acceptable.

t-sasatani commented 2 months ago

@sneakers-the-rat Do we already want to start a PR from this branch, or do you want to add something soon?

t-sasatani commented 2 months ago

Actually, I'll just start one later because we need this update soon even if it's not perfect.