AllenInstitute / AllenSDK

code for reading and processing Allen Institute for Brain Science data
https://allensdk.readthedocs.io/en/latest/
Other
335 stars 149 forks source link

SDK to return neuropil corrected traces, masks, and associated metrics #2565

Closed matchings closed 1 year ago

matchings commented 1 year ago

Describe the use case that is addressed by this feature. To validate the data they are using for analysis, filter outliers, and understand the processing steps, users need the neuropil masks, neuropil corrected traces, and associated metrics produced by neuropil correction algorithm. Currently none of this information is provided, creating ambiguity and lack of trust regarding our processing. Without this info to validate against their analysis, users may choose not to use our data, or may end up with artifacts in their analysis associated with issues in the data processing. It's impossible to prevent all issues in processing, so the best thing to do is to provide the intermediate steps and artifacts so that users can decide for themselves whether to trust a given ROI and its traces.

Describe the solution you'd like Currently the behavior_ophys_experiment object associated with VisualBehaviorOphysProjectCache only returns demixed traces (as described in #2524). The dataset object should also include neuropil corrected traces as an attribute (as a table similar to dff_traces and corrected_fluorescence), along with the neuropil masks and associated metrics, including the r-value and the RMSE. These values are saved in the 'neuropil_correction.h5' file in the lims directory for each experiment.

Here's a screenshot to illusrate

image

Describe alternatives you've considered An alternative could be to provide the code for users to compute the neuropil masks and r-values themselves, but the underlying code may change, and the computation of these values can vary on different runs of the algorithm, so i dont think that actually meets the use case.

Additional context Having this information would help users to identify potential artifacts or other issues in ROIs they may want to exclude, and to trace the provenance / history of the traces they are using for analysis. For example, if we had these additional values, we could have more easily detected the problems with neuropil correction that @jkim0731 has detected and is working through in this issue: https://github.com/AllenInstitute/LearningmFISHTask1A_data_validation/issues/15 . For example, the r-value should never be greater than 1, but Jinho found many instances of this by digging through the relevant files in lims. A potential cause of this could be that there is a bright dendrite or other parts of a neuron in the neuropil masks, resulting in real signal being subtracted from the main ROI and throwing off the r-value (and subsequent dFF calculation). Being able to easily visualize the neuropil masks would allow us (and external users) to diagnose such issues. Right now there is very little insight into these processing steps via SDK.

Do you want to work on this issue? @jkim0731 may be able to provide code that he has used to extract and inspect the relevant pieces of information

jkim0731 commented 1 year ago

Thanks @matchings for opening this issue. I have an additional note regarding this issue.

There was no information about neuropil masks that I could find, so this would have to be recreated. One of the reasons for looking at the neuropil masks is to check if the neuropil masks excluded all the segmented ROIs before filtering process (not only the valid ROIs).

morriscb commented 1 year ago

Hey, both. We discussed this ticket in our backlog refinement and came to the conclusion that we are going to split this into two different parts. We think that adding the neuropil corrected traces and the r andRMSE values etc is decently straight forward and can be added separate from the neuropil masks. Adding the masks will be a larger piece of work.

We checked through the pipelines and, as Jinho alluded to, it doesn't look like the neuropil masks are saved currently. The nueropile masks look to be computed on the fly using this block of code this set of code. So we should be able to recreate them using this code and not worry too much about the mask calculation diverging between pipeline and data packaging. When adding the masks, where should they be attached? For instance, we could provide them with the set of ROIs associated with the experiment or some other location. Or, it could be a function called by the user on the ophys_experiment object that returns a table of the neuropil masks for each ROI in the experiment. Let us know how you would like them to be retrieved.

matchings commented 1 year ago

@morriscb this sounds great, thanks.

It makes sense to me to put the neuropil masks in the cell_specimen_table along with the existing roi_masks, and just make it another column called neuropil_masks. I’m open to other suggestions but that seems most straightforward to me.

morriscb commented 1 year ago

As discussed during refinement: given that the neuropil masks are not currently saved as a dataset and the the need to use legacy code to reproduce the data, we will shelve storing the neuropil masks for now until they are produced and saved as part the processing pipeline.

morriscb commented 1 year ago

Hey @matchings and @DowntonCrabby, I'm looking to add the r and RMSE values into the output data. One question I have is where should these variables be put? Do they just end up as part of the cell_specimens table for a given experiment next to the corrected traces? @aamster Have you got any opinions as well?

matchings commented 1 year ago

@morriscb my preference would be to include them as columns in a table that provides neuropil_corrected_traces, similar to how the events table has columns for events, filtered_events and the robust_signal and robust_noise metric values (or at least I think that is what they are called).

In other words, I think that the r And RSME values should be inherently associated with the neuropil corrected traces, since they are used in the computation of those traces.

But if there are strong opinions in the other direction, I could be ok with them being in the cell_specimen_table (the most important thing is that they are available), that’s just less intuitive to me.

mikejhuang commented 1 year ago

@matchings @jkim0731

Hi, I'm currently revisiting this to address the part on the neuropil_masks. Currently, all subsequent runs of the workflow will generate data outputs to store the neuropil masks and further work will be done to include them in the cell_specimen_table alongside roi_mask.

I also looked into regenerating these neuropil_masks for the already processed datasets.

For the released data, the neuropil_masks cannot be regenerated through the ROIs stored in the NWB files since only the valid ROIs are stored there whereas the neuropil_mask generation requires all segmented ROIs including the invalid ones as an input.

If an internal user with access to lims wants to recreate these masks, I can write up a notebook to show how to do this. The masks can also be validated by recomputing the trace extraction and comparing this trace with the stored ones.

Let me know if there is still interest in accessing the neuropil_masks by our internal users who has access to lims and I can write up that notebook.