make streamdaq run good

sneakers-the-rat commented 2 months ago

profiling the processes and it looks like we got lots of bottlenecks.

problem: the test in https://github.com/Aharoni-Lab/miniscope-io/pull/26 processes like 100 frames, so it should be basically instantaneous. instead it takes like 15s. that won't cut it for realtime usage because 15s > 5s.

bottlenecks:

fpga_recv - it spends the whole time waiting to put frames in the queue, so the bottleneck isn't here!
_buffer_to_frame - same thing here!
_format_frame - now here we get something:

this is the first time i've taken a look at this method, but it's extremely expensive in a bunch of ways:

iterating over every pixel here is wild as hell https://github.com/Aharoni-Lab/miniscope-io/blob/aabd2322af5f60fb91693679a0d4188a6606a771/miniscope_io/stream_daq.py#L359 I can't really tell what that's doing but that's v much not a thing that is done in python. it seems like we're just checking that there are the correct number of pixels? in python we would typically do if len(thing) != expected: warn('think not right size') - we should be using an array library like numpy to check shapes if we need to do that, with the operations vectorized over the array, rather than doing it pixelwise.
We then iterate over every pixel again here: https://github.com/Aharoni-Lab/miniscope-io/blob/aabd2322af5f60fb91693679a0d4188a6606a771/miniscope_io/stream_daq.py#L394-L397 again for reasons i don't understand
we then iterate over the whole image a third time, this time in chunks of 32: https://github.com/Aharoni-Lab/miniscope-io/blob/aabd2322af5f60fb91693679a0d4188a6606a771/miniscope_io/stream_daq.py#L403-L410 and i think i had something to do with the list comprehension in there, but that's one of the major bottlenecks - if it's just flipping an array we should be using numpy (i don't know what this bitstring library is and using the vectorized operations there.
then in the very next line we recast that thing to a numpy array anyway - https://github.com/Aharoni-Lab/miniscope-io/blob/aabd2322af5f60fb91693679a0d4188a6606a771/miniscope_io/stream_daq.py#L410 despite that very expensive Array cast just beforehand and the tobytes() cast that needs to happen in the middle.
that doesn't even get us a completed frame, because it needs to get reshaped in the outer loop: https://github.com/Aharoni-Lab/miniscope-io/blob/aabd2322af5f60fb91693679a0d4188a6606a771/miniscope_io/stream_daq.py#L522

so i'm not really sure what that _format_frame method is doing, but it's making the daq much slower than realtime. maybe the place to start is for whoever wrote that to describe what's going on there so we can refactor it?

@phildong @MarcelMB @t-sasatani any insight?

t-sasatani commented 2 months ago

This is interesting and looks pretty critical. I think I'm responsible for most of this. The code around queues hasn't changed much since I first quickly wrote this for 1FPS transfer, and I don't remember much about it to be honest.

I'll look into and address/document this later too.
The current test data is extremely short, but I was wondering if a longer data would help us knowing how this gets worse with increased video size. Do you think we should host test data for this higher-load somewhere? (This isn't mandatory but would be nice for remote development)

sneakers-the-rat commented 2 months ago

I think ideally for tests we want small data, but if we want to do long-running tests we can synthesize data (so it would be good to have a format that could generate as well as parse data! in time.)

sneakers-the-rat commented 2 months ago

OK i started working on this here: https://github.com/Aharoni-Lab/miniscope-io/tree/perf-streamdaq

it just seems like we have a sorta bad division of labor that's forcing a bit of inconsistency.

Here's how the SDCard.read method works:

read header
read buffer
trim/pad buffer
add buffer to frame in construction
when reaching end of frame, reshape array and return

Currently the stream_daq pipeline is like:

fpga_recv
- read header + frame buffer
- case to BitArray
- search through buffer to find preamble string
- do something that looks like trimming and concatenating buffers together, not sure
- put trimmed/concatenated buffer into queue as bytes
buffer_to_frame
- cast to Bits
- parse header from buffer
- save full buffer (including header) in list
- when reaching next frame, put list of full buffers in queue
format_frame
- iterate through list of buffers
- cast to Bits again
- parse header again
- trim/pad
- concatenate buffers
- reverse buffer in chunks (very very expensive)
- cast to 1d numpy array
- put in queue
capture
- reshape array
- save

So basically what needs to happen is

parse header once, separate header from array data as soon as possible
immediately cast header to model and buffer data to numpy array
pad/trim, incl filling in missing buffers
reshape

but there's enough magic happening in the middle that i can't quite get it working right now.

there should basically be only one additional process, and all it should be doing is grabbing the bitstream from the device and shoving it in as large of a queue as can fit in memory so that we don't lose any buffers from the device. the rest doesn't really benefit from multiprocessing (transferring data between processes is expensive).

in fact since each of the buffers have a very small max size, and the last step in the pipeline is the limiting step, every other step will hang on the put and get calls, which probably explains the large number of missed buffers here: aka the current design makes it so that we not only can only process at 1/3 realtime, but we can only acquire from the device at 1/3 realtime.

t-sasatani commented 2 months ago

Thanks for looking into this! I'm starting to recall how this worked as well.

I think I added most of the crappy magic when I needed to handle partially corrupted data and was experimenting on bit-order, etc. We can't always expect a complete metadata header in our project.
My guess was that the data is mostly corrupted at the physical signal level, but as you said, it seems likely that the acquisition part is also causing this. This test data is around the speed where the analog processing circuit starts showing errors so I think both causes are contributing to the corruption. So, when I can get to the bench, I can add a bit of slowed-down test data that will hopefully be useful to decouple these causes.

t-sasatani commented 2 months ago

My bad, I should have made a branch from main for the commit I just made and merged a962855. I merged anyways but pls feel free to just revert it.

sneakers-the-rat commented 2 months ago

We can't always expect a complete metadata header in our project.

Hell yes!!!! Lets make something robust and nimble. I definitely appreciate how careful the current code is. Lets try and refine the transmission process with the right balance of robustness on the transmission and reception side.

This test data is around the speed where the analog processing circuit starts showing errors so I think both causes are contributing to the corruption.

Also hell yes!!! Lets try and isolate each of the sources of problems by making each robust, independently tested and optimized!

phildong commented 2 months ago

Just throwing in whatever I can remember on this when I was working on the codebase:

I remember there was no reason buffer_to_frame and format_frame should be two separate process/function call and marked it as future refactoring task in my head which never happened :), so yeah totally agree with @sneakers-the-rat some merging should be happening here.
The only reason I added in all the Bits BitArray stuff is because when I was working on this, in some version of the image sensor MCU firmware, the bit order within every single 32-bit word is reversed (like LSB but at word level). And I just couldn't figure out an easy way to do it with native python/numpy. However this seems no longer true according to https://github.com/Aharoni-Lab/miniscope-io/pull/17. If that's the case, I believe we can refactor and completely take out the dependency on bitarray.

It's been a while and I barely remember any details, but let me know what I can help and I can dig in!

t-sasatani commented 2 months ago

I was playing around with the branch @sneakers-the-rat left and updated the handling around bits/bytes. Now, it passes tests and takes about 2 seconds to process a 5-second video frame on my PC, which is acceptable.

I put all the bit operations in a dedicated module. This has a bytebuffer_to_ndarrays method that grabs raw buffers and outputs numpy arrays for headers (uint32) and payload (uint8) with modified bit/byte order. So, the rest of the module can just deal with a modeled header and a numpy array with ordered pixel data.
buffer_to_frame and format_frame are still separate; @sneakers-the-rat already made these very simple, so combining them looks pretty easy.
I tried making an automatic profiling script, but it seemed pretty complex with the multiprocess. I might come back to this at some point (mostly for curiosity), but I will leave for now.

t-sasatani commented 2 months ago

@sneakers-the-rat Do we already want to start a PR from this branch, or do you want to add something soon?

t-sasatani commented 2 months ago

Actually, I'll just start one later because we need this update soon even if it's not perfect.

Aharoni-Lab / miniscope-io

make streamdaq run good #27