AllenInstitute / segmentation-labeling-app

Data pipeline and UI for human labeling of putative ROIs from 2p cell segmentations
Other
0 stars 0 forks source link

Add traces to data extracted from Suite2P #56

Open djkapner opened 4 years ago

djkapner commented 4 years ago

The segmentation labeling app needs traces. Originally, the design was for trace extraction to occur during the pre-labeling transformations, which have as input the postgres rois and segmentation_runs table (these link to the source movie.) That transformation is not implemented, so we decided a faster route is to:

kschelonka commented 4 years ago

What about a separate table for traces rather than storing all these values in an array?

Something like this:

Column Type Description
id int Primary key
timestamp float4 Timestamp of value in trace array
index int Value of index in trace array (?) For sorting/downsampling
roi_id int Foreign key to rois table
trace float4 Value of trace at point
djkapner commented 4 years ago

I don't see a scenario where we'd want to perform SQL-style queries within the trace. I think the trace, simply as a 1D list/array is sufficient. Your proposal would result in a potentially enormous SQL table, adding also 4 bytes per data point (from the timestamp), doubling the size on disk. I am concerned about size: Let's say a movie has 1e5 frames and the trace values are stored as 4-byte floats: Per roi: 400kB right now, we have 1.6M ROIs in the database. Let's just call it 1M: total trace storage: 400GB I am not entirely sure which partition of aibsdc-dev-db1 the postgres databases are stored in, but, only 1 partition is even close, and that would consume most of it. I think postgres real is the best choice (worried about cornering ourselves with decimal). And, given the space concerns, I think we need to make some "which experiments" choices sooner, rather than just run all of them.

kschelonka commented 4 years ago

Point taken for sure about using the dev db and storage limits. Where I would caution (and this is a problem with the trace h5 files too) is in having essentially unitless data. If everything is just dumped to an array, there's no metadata about how the trace actually lines up with the source video.

On Fri, Apr 24, 2020, 12:12 PM Dan Kapner notifications@github.com wrote:

I don't see a scenario where we'd want to perform SQL-style queries within the trace. I think the trace, simply as a 1D list/array is sufficient. Your proposal would result in a potentially enormous SQL table, adding also 4 bytes per data point, doubling the size on disk. I am concerned about size: Let's say a movie has 1e5 frames and the trace values are stored as 4-byte floats: Per roi: 400kB right now, we have 1.6M ROIs in the database. Let's just call it 1M: total trace storage: 400GB I am not entirely sure which partition of aibsdc-dev-db1 the postgres databases are stored in, but, only 1 partition is even close, and that would consume most of it. I think postgres real is the best choice (worried about cornering ourselves with decimal. And, given the space concerns, I think we need to make some "which experiments" choices sooner, rather than just run all of them.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/AllenInstitute/AllenSDK/issues/1521#issuecomment-619193427, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIFEJBQBLEDP3RL5T65CTMTROHQANANCNFSM4MQKMREA .

djkapner commented 4 years ago

Agree about unitless. I think this is a pill to swallow while we're between pipelines. We can include a validation check during transform_pipeline downsampling that the source video and trace have the same size of the first dimension. That would make sure things aren't out-of-sync and also indicate in the code that they are supposed to line up.