AllenInstitute / ophys_etl_pipelines

Pipelines and modules for processing optical physiology data
Other
9 stars 5 forks source link

Determine requirements to hook Suite2p segmentation up to subsequent processing steps #481

Closed aamster closed 2 years ago

aamster commented 2 years ago

Marina wants the ability to run data through suite2p segmentation, and then run the detected ROIs through subsequent processing steps (neuropil subtraction, demixing, decrosstalk, trace extraction) to ultimately obtain DF/F traces.

Marina does not need to use the denoiser or classifier.

At minimum, we need the ability to do this in an ad hoc fashion so that Suite2p segmentation outputs can be fed as input into subsequent steps.

Ultimately, this should be a queue on LIMS that she could run herself.

There does appear to be an OPHYS_SUITE2P_QUEUE that runs the segment_postprocess module from ophys_etl. This module runs Suite2P segmentation and claims to produce LIMS-compatible outputs. Probably the first step in this work is to determine which modules need to consume segmentation outputs and see if they can be run on the outputs of segment_postprocess

The purpose of this ticket is to determine whether or not any changes need to be made to ophys_etl.modules.segment_postprocess before it gets incorporated into any LIMS strategies or queues.

Tasks

Note: even though this was inpsired by Marina's request on behalf of the learning-mFISH group, this is something we will ultimately have to do as we productionize the 2022 segmentation development work.

danielsf commented 2 years ago

Resources that may be helpful:

Workflow charts showing which LIMS strategies feed into each other http://lims2.corp.alleninstitute.org/workflow_charts

Table mapping LIMS executables to queues http://lims2.corp.alleninstitute.org/executables

Repo containing the ruby code for the LIMS strategies http://stash.corp.alleninstitute.org/projects/TECH/repos/lims/browse/app/strategies

morriscb commented 2 years ago

Looking over the code in queues, it seems as if the output that Marina is referring to in her email is produced by the CELL_ROI_CREATION queue. Looking at the ophys code, I would guess that the replacement ohpys will need replace both this queue and the VISUAL_BEHAVIOR_OPHYS_CELL_SEGMENTATION queue. I can't tell directly from the workflow chart where downstream the outputs of the segmented cells is consumed.

Looking through Marina's doc and the code, the main output product I'm worried about is exclusion_labels. These appear to be calculated from a sort of metadata in the legacy code given a specific model for flagging the data. Here's the code. The worry I have with this code is trying to match a set of data quality flags using the suite2p output to behave the same way as the legacy code. This is likely fine but if users are relying on these flags and expect them to behave the same that may be difficult to make happen. I'll continue looking through the code and attempt to find a downstream consumer code and see what is missing outside of known ROI data.

morriscb commented 2 years ago

After looking through LIMS and the input/output of the modules around ROI creation I think I can confirm that our pipeline in ophys_etl is basically ready to go assuming there isn't something strange in the Ruby strategies that I don't understand. Starting from the beginning:

The legacy C++ code that segments that data is run in two queues: VISUAL_BEHAVIOR_OPHYS_CELL_SEGMENTATION and OPHYS_CELL_SEGMENTATION. Both of those queues feed into CELL_ROI_CREATION which then feeds directly to storage or into OPHYS_EXTRACT_TRACES. Interestingly from the input jsons I've looked at OPHYS_EXTRACT_TRACES seems to expects the ROI data to be in a format that is much closer of that output by the ophys_etl code than . (I've attached an example). Looking at the Ruby strategy for OPHYS_EXTRACT_TRACES there seems to be specific code to translate and rename the roi "columns" from those output by the ophys_etl code and those needed by extract traces. (line 91 in the strategy). One of the main things that deviate from what Marina set us is that the ROI ids are not a set of "mask:id" but are simply a number.

With this in mind, I was able to successfully run the extract traces pipeline code outside of LIMS using ROIs output from ophys by renaming some of the ROI columns to the expected name. This was done on experiment 790118079 and ran successfully, producing the traces for the majority of ROIs.

There's a few outstanding questions from this that may or may not be important:

Having said all that, it does not seem as if the above is required for running the downstream process. Questions going forward are:

morriscb commented 2 years ago

To answer the last question in the original ticket branch: If I'm reading the code correctly, we shouldn't need to modify much of anything to insert the new segmentation class.

morriscb commented 2 years ago

Here's some of the files I forgot to upload yesterday. LIMS extract trace input My extract trace input Extract trace output

Also, I looked over the code for PostProcessROIs again and can see that there is at least an attempt to recreate some of the exclusion criteria in the code so some of the statements above are much less concerning. I think we may be good to go modulo some set of unforseen problems with LIMS.

Also I'll link Marina's notebook showing the schemas that are output from CELL_ROI_CREATION that are not the same as what goes into downstream processing. Here's the notebook.