Determine requirements to hook Suite2p segmentation up to subsequent processing steps

aamster commented 2 years ago

Marina wants the ability to run data through suite2p segmentation, and then run the detected ROIs through subsequent processing steps (neuropil subtraction, demixing, decrosstalk, trace extraction) to ultimately obtain DF/F traces.

Marina does not need to use the denoiser or classifier.

At minimum, we need the ability to do this in an ad hoc fashion so that Suite2p segmentation outputs can be fed as input into subsequent steps.

Ultimately, this should be a queue on LIMS that she could run herself.

There does appear to be an OPHYS_SUITE2P_QUEUE that runs the segment_postprocess module from ophys_etl. This module runs Suite2P segmentation and claims to produce LIMS-compatible outputs. Probably the first step in this work is to determine which modules need to consume segmentation outputs and see if they can be run on the outputs of segment_postprocess

The purpose of this ticket is to determine whether or not any changes need to be made to ophys_etl.modules.segment_postprocess before it gets incorporated into any LIMS strategies or queues.

Tasks

[x] Determine which LIMS queues consume the outputs of legacy segmentation.
[x] Construct example input.jsons for those queues/modules based on the outputs of Suite2P segmentation. Examples of Suite2P segmentation outputs can be found /allen/programs/mindscope/workgroups/surround/denoising_labeling_2022/segmentations
[x] Try to run the downstream executables based on those example input.jsons.
[x] Document any failures and recommend courses of action to make it possible to run Suite2P segmentation in production.
[x] Try to determine what the output.json produced by legacy segmentation looks like and whether or not the outputs from ophys_etl.modules.segment_postprocess can be easily altered to fit that schema.
[x] Make a recommendation whether we should proceed by modifying the output from ophys_etl.modules.segment_postprocess or by modifying the inputs of downstream modules.

Note: even though this was inpsired by Marina's request on behalf of the learning-mFISH group, this is something we will ultimately have to do as we productionize the 2022 segmentation development work.

danielsf commented 2 years ago

Resources that may be helpful:

Workflow charts showing which LIMS strategies feed into each other http://lims2.corp.alleninstitute.org/workflow_charts

Table mapping LIMS executables to queues http://lims2.corp.alleninstitute.org/executables

Repo containing the ruby code for the LIMS strategies http://stash.corp.alleninstitute.org/projects/TECH/repos/lims/browse/app/strategies

morriscb commented 2 years ago

Looking over the code in queues, it seems as if the output that Marina is referring to in her email is produced by the CELL_ROI_CREATION queue. Looking at the ophys code, I would guess that the replacement ohpys will need replace both this queue and the VISUAL_BEHAVIOR_OPHYS_CELL_SEGMENTATION queue. I can't tell directly from the workflow chart where downstream the outputs of the segmented cells is consumed.

Looking through Marina's doc and the code, the main output product I'm worried about is exclusion_labels. These appear to be calculated from a sort of metadata in the legacy code given a specific model for flagging the data. Here's the code. The worry I have with this code is trying to match a set of data quality flags using the suite2p output to behave the same way as the legacy code. This is likely fine but if users are relying on these flags and expect them to behave the same that may be difficult to make happen. I'll continue looking through the code and attempt to find a downstream consumer code and see what is missing outside of known ROI data.

morriscb commented 2 years ago

After looking through LIMS and the input/output of the modules around ROI creation I think I can confirm that our pipeline in ophys_etl is basically ready to go assuming there isn't something strange in the Ruby strategies that I don't understand. Starting from the beginning:

The legacy C++ code that segments that data is run in two queues: VISUAL_BEHAVIOR_OPHYS_CELL_SEGMENTATION and OPHYS_CELL_SEGMENTATION. Both of those queues feed into CELL_ROI_CREATION which then feeds directly to storage or into OPHYS_EXTRACT_TRACES. Interestingly from the input jsons I've looked at OPHYS_EXTRACT_TRACES seems to expects the ROI data to be in a format that is much closer of that output by the ophys_etl code than . (I've attached an example). Looking at the Ruby strategy for OPHYS_EXTRACT_TRACES there seems to be specific code to translate and rename the roi "columns" from those output by the ophys_etl code and those needed by extract traces. (line 91 in the strategy). One of the main things that deviate from what Marina set us is that the ROI ids are not a set of "mask:id" but are simply a number.

With this in mind, I was able to successfully run the extract traces pipeline code outside of LIMS using ROIs output from ophys by renaming some of the ROI columns to the expected name. This was done on experiment 790118079 and ran successfully, producing the traces for the majority of ROIs.

There's a few outstanding questions from this that may or may not be important:

I'm not sure where the conversion between the the output of the ROIs from CELL_ROI_CREATION that Marina shows and the input to OPHYS_EXTRACT_TRACES. I assumed it would be in the Ruby strategies but if it is it isn't obvious to me.
The legacy an extra variable exclude_code which is related (and I think a bitpacked version of) exclusion_labels. The ophys_etl code does have exclusion_labels which I think only really gets used for motion borders. The legacy coded uses a pickled set of criteria to create these exclusion labels/codes that isn't ported to ophys_etl. However, looking through the legacy code python wrappers, there were comments that alluded to these labels also being legacy. It's hard to say. This may be a question for the science team of "Hey, these labels are likely going to change. How important is that considering the segmentation has fully changed as well and likely doesn't mean completely the same thing either?"

Having said all that, it does not seem as if the above is required for running the downstream process. Questions going forward are:

Are the exclusion_labels/codes used for anything analysis wise and if so do we need to re-create them in the ophys_etl code or do we let them run as is?
Are we happy with looking at the most immediate downstream consumer OPHYS_EXTRACT_TRACES as the barometer of "this should work" or do we need to look down the pipelines even further? Is that even possible for us to do or will we just have to rely on spinning up a test LIMS or pipeline and seeing if we can get everything to run from end to end?

morriscb commented 2 years ago

To answer the last question in the original ticket branch: If I'm reading the code correctly, we shouldn't need to modify much of anything to insert the new segmentation class.

morriscb commented 2 years ago

Here's some of the files I forgot to upload yesterday. LIMS extract trace input My extract trace input Extract trace output

Also, I looked over the code for PostProcessROIs again and can see that there is at least an attempt to recreate some of the exclusion criteria in the code so some of the statements above are much less concerning. I think we may be good to go modulo some set of unforseen problems with LIMS.

Also I'll link Marina's notebook showing the schemas that are output from CELL_ROI_CREATION that are not the same as what goes into downstream processing. Here's the notebook.

AllenInstitute / ophys_etl_pipelines

Determine requirements to hook Suite2p segmentation up to subsequent processing steps #481