cellarium-ai / CellMincer

CellMincer is a software package for self-supervised denoising of voltage imaging datasets
BSD 3-Clause "New" or "Revised" License
4 stars 0 forks source link

Dataset handling #14

Open b-grimaud opened 2 months ago

b-grimaud commented 2 months ago

Hi, I'm in the process of setting up voltage imaging experiments and I came across this project as I was looking for analysis tools.

I just had a few questions regarding how data is handled :

Full frame 16 bits videos on our setup average around 2 Gb per thousand frames, recording at 0.5-1 kHz will add up quite fast. Otherwise each file can be treated as its own dataset and we can iterate through a folder with a bash script but I'm not sure that would be the most efficient way of doing things.

bricewang commented 1 month ago

Apologies for the slow response. To address your inquires, CellMincer expects each dataset to be a single file structured in the way you described, though accidental truncations of the dataset from the recording process can be accounted for by overriding the n_frames manifest property.

While we have a provision for temporarily writing large corpora to disk during training when memory is insufficient to hold every dataset, we have not explicitly designed a procedure for lazy loading sections of a single large dataset. From some back-of-the-envelope math, is my estimation accurate that your dataset frame size is around 1000x1000 or something equivalent? The datasets we had in mind when developing CellMincer were on the order of 100x512.

If your aim is to train on a single such dataset and it fits in memory, we would recommend to proceed as-is, as the long training time of CellMincer makes any workarounds incur a relatively large hit to performance.

If it doesn't fit in memory, we would suggest that the dataset be dissected into smaller datasets, either by segmenting spatially (preferred) or segmenting by time. Given the size of the data, we would likely expect enough signal in each fragment for the model to work with.

We may release a future update to either enable lazy loading of large datasets or to specify manifest parameters for sharding and reconstituting a dataset as described above.