AllenInstitute / ophys_etl_pipelines

Pipelines and modules for processing optical physiology data
Other
9 stars 5 forks source link

Add neuropil masks output to trace_extraction module #562

Closed mikejhuang closed 1 year ago

mikejhuang commented 1 year ago

https://github.com/AllenInstitute/AllenSDK/issues/2565 Neuropil masks are requested to be stored in the cell_specimens table so the users can access them for provenance/diagnosis of issues. This is the current work to generate the output file which further work will be needed to modify the AllenSDK to read the file and load into the cell_specimens table.

Thinking about it more, it might make more sense to modify the AllenSDK to create the neuropil_mask from the roi_mask and store it in the cell_specimens table. This strategy would allow users to access these masks for all existing datasets. The logic for creating the mask is fixed with no user-modifiable parameters.

https://github.com/AllenInstitute/ophys_etl_pipelines/blob/3fd19523c1c4ca4de0819d86946901a85b3ab511/src/ophys_etl/utils/roi_masks.py#L330

codecov[bot] commented 1 year ago

Codecov Report

Merging #562 (6b9f0c4) into main (d5ec183) will decrease coverage by 1.34%. The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #562      +/-   ##
==========================================
- Coverage   97.60%   96.27%   -1.34%     
==========================================
  Files         105      105              
  Lines        8226     8239      +13     
  Branches      769      772       +3     
==========================================
- Hits         8029     7932      -97     
- Misses        119      229     +110     
  Partials       78       78              
Flag Coverage Δ
event_detection_tests ?
general_tests 96.27% <100.00%> (+<0.01%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
tests/modules/trace_extraction/conftest.py 100.00% <100.00%> (ø)
.../modules/trace_extraction/test_trace_extraction.py 100.00% <100.00%> (ø)
...es/trace_extraction/test_trace_extraction_utils.py 100.00% <100.00%> (ø)

... and 4 files with indirect coverage changes

mikejhuang commented 1 year ago

Code looks good, however I want to ask about the output file type. I imagine that you're using HDF5 for speed of loading the data correct? This is a change from the ROI masks that are stored in plain text json. I guess what I feel is missing on this ticket is code for loading the masks from the newly created HDF5 file. I'm for this to be another ticket, but I feel we need to develop a method near the Neuropil mask code that takes in the HDF5 file and returns a list of Neuropil masks.

I wrote this so the mask arrays can be easily be loaded into a dataframe to be appended to the cell_specimen table since that was the intended use case as defined by the ticket by Marina. There wasn't an indication that users would want to open up the file to recreate the mask objects so I think that's outside the scope of this ticket.

morriscb commented 1 year ago

Out of scope is why I said I'm happy for it to be another ticket. One thing I'd like to confirm with the data you are storing is that you can create a NeuropilMask object from it. If you can't, then I'm not sure you're storing enough information in the h5 file to be useful.

mikejhuang commented 1 year ago

Out of scope is why I said I'm happy for it to be another ticket. One thing I'd like to confirm with the data you are storing is that you can create a NeuropilMask object from it. If you can't, then I'm not sure you're storing enough information in the h5 file to be useful.

You don't think I fulfilled the requirements outlined in the ticket opened by Marina? The requirement there is to load the mask onto the cell_specimens_table alongside the roi_mask. They want to allow users to visualize the mask. I interpret this as plotting the numpy arrays stored in the dataframe. The H5 file has the information needed to load the neuropil_mask and to join it to the cell_specimens_table based on the roi_name.

In the data releases, the motion_corrected_movie isn't included so the trace would not be able to be regenerated from the neuropil mask. Thus would there even be a use case to load the data into the neuropil mask objects?

morriscb commented 1 year ago

Right, but I want to be able to confirm that all the information you need to do that in a useful way is stored. Since we already have a NeuropilMask object that is used in the code, my feeling is that at minimum you should be able to instantiate that object using the information from the neuropil data we wrote out. Treat it as a schema of sorts. Also, importantly, just storing the mask array for the neuropil isn't sufficient to place that masks location within the image/movie unless the mask shape is the same shape as the movie which I'm not sure it is.

mikejhuang commented 1 year ago

Right, but I want to be able to confirm that all the information you need to do that in a useful way is stored. Since we already have a NeuropilMask object that is used in the code, my feeling is that at minimum you should be able to instantiate that object using the information from the neuropil data we wrote out. Treat it as a schema of sorts. Also, importantly, just storing the mask array for the neuropil isn't sufficient to place that masks location within the image/movie unless the mask shape is the same shape as the movie which I'm not sure it is.

I considered if the required information to place the masks location within the movie should be included in the output file, but I decided it wasn't needed since that information is included in the cell_specimens table and it's included in the ROI file. Storing multiple copies of the same information across multiple files often raises concerns about accidental discrepancies. If the neuropil_mask_objects need to be recreated, one solution is to take both the ROI file and the neuropil_masks file as inputs.

morriscb commented 1 year ago

Right, but I want to be able to confirm that all the information you need to do that in a useful way is stored. Since we already have a NeuropilMask object that is used in the code, my feeling is that at minimum you should be able to instantiate that object using the information from the neuropil data we wrote out. Treat it as a schema of sorts. Also, importantly, just storing the mask array for the neuropil isn't sufficient to place that masks location within the image/movie unless the mask shape is the same shape as the movie which I'm not sure it is.

I considered if the required information to place the masks location within the movie should be included in the output file, but I decided it wasn't needed since that information is included in the cell_specimens table and it's included in the ROI file. Storing multiple copies of the same information across multiple files often raises concerns about accidental discrepancies. If the neuropil_mask_objects need to be recreated, one solution is to take both the ROI file and the neuropil_masks file as inputs.

Have you confirmed this? As in, is every neuropil mask the same shape as the ROI mask? I seem to recall there potentially being a step where the ROI mask was trimmed down. Also, I disagree that we don't need to store the extra information. The masks should be useful on their own, that say we wanted to investigate how the neuropil masks are preforming. We should be able to just load the file and plot them on their own without relying on other information from the ROIs. This would be useful if we were to have debug the pipeline in the future for instance.

mikejhuang commented 1 year ago

Right, but I want to be able to confirm that all the information you need to do that in a useful way is stored. Since we already have a NeuropilMask object that is used in the code, my feeling is that at minimum you should be able to instantiate that object using the information from the neuropil data we wrote out. Treat it as a schema of sorts. Also, importantly, just storing the mask array for the neuropil isn't sufficient to place that masks location within the image/movie unless the mask shape is the same shape as the movie which I'm not sure it is.

I considered if the required information to place the masks location within the movie should be included in the output file, but I decided it wasn't needed since that information is included in the cell_specimens table and it's included in the ROI file. Storing multiple copies of the same information across multiple files often raises concerns about accidental discrepancies. If the neuropil_mask_objects need to be recreated, one solution is to take both the ROI file and the neuropil_masks file as inputs.

Have you confirmed this? As in, is every neuropil mask the same shape as the ROI mask? I seem to recall there potentially being a step where the ROI mask was trimmed down. Also, I disagree that we don't need to store the extra information. The masks should be useful on their own, that say we wanted to investigate how the neuropil masks are preforming. We should be able to just load the file and plot them on their own without relying on other information from the ROIs. This would be useful if we were to have debug the pipeline in the future for instance.

Good call. I modified the output to store the information used to recreate the mask since it's different than the ROI info.