AllenInstitute / ophys_etl_pipelines

Pipelines and modules for processing optical physiology data
Other
9 stars 5 forks source link

Create 2p event detection module #75

Closed wbwakeman closed 3 years ago

wbwakeman commented 4 years ago

Tasks:

djkapner commented 3 years ago

Peter has 1537 experiments here: /allen/programs/braintv/workgroups/nc-ophys/visual_behavior/event_detection on which he has previously run event detection offline. There are 4 genotypes represented, and these have already specific decay times specified in the code:

Slc17a7-IRES2-Cre/wt;Camk2a-tTA/wt;Ai93(TITL-GCaMP6f)/wt 0.34692165705293476
Slc17a7-IRES2-Cre/wt;Camk2a-tTA/wt;Ai94(TITL-GCaMP6s)/wt 0.7375508677946965
Sst-IRES-Cre/wt;Ai148(TIT2L-GC6f-ICL-tTA2)/wt 0.46368295442535684
Vip-IRES-Cre/wt;Ai148(TIT2L-GC6f-ICL-tTA2)/wt 0.5204879281347294

These are the same genotypes for the upcoming VB release. In addition to genotype, we want to make sure that we are reproducing the past results across frame rates.

Here are 4 experiments for spot-checking results against Peter's runs. Due probably to environment differences, the results are slightly different. The "new" results are ones that I have run with a new containerized version of the code that is not dependent on VBA or AllenSDK. It is a stepping-stone to making a production version. The 3 plots are ones that Peter and I made to decide these were close enough results. Left panel: pearsons correlation of event traces (Peter's run vs. new run) Middle panel: integrated event magnitude per cell (Peter vs. new) Right panel: minimum non-zero event magnitude (Peter vs. new)

792815735_comparison.png 799368904_comparison.png 881003484_comparison.png 955276580_comparison.png

The conclusion here is that for different genotypes and different frame rates relevant to release, we can reproduce Peter's results. @ledochowitsch @matchings

djkapner commented 3 years ago

@wbwakeman this seems to be what the event detection will need to run for one experiment:

{
   'experiment_id': 1053774037,
   'movie_frame_rate_hz': 31.0,
   'full_genotype': 'Vip-IRES-Cre/wt;Ai148(TIT2L-GC6f-ICL-tTA2)/wt',
   'ophysdfftracefile': '/allen/programs/braintv/production/visualbehavior/prod4/specimen_1022959551/ophys_session_1053696498/ophys_experiment_1053774037/1053774037_dff.h5',
   'output_file': <desired path of output>,
   'valid_roi_ids': [1053801242, 1053801240,1053801245, ...]
}

which I am getting (with a couple of extras) by:

SELECT oe.id as experiment_id,
       oe.movie_frame_rate_hz,
       eq.name as rig,
           d.full_genotype as full_genotype,
           oe.storage_directory || wkf.filename as OphysDffTraceFile,
           array(SELECT cell_rois.id
                         FROM cell_rois
                         WHERE cell_rois.valid_roi=True
                         AND cell_rois.ophys_experiment_id=oe.id) as valid_roi_ids
FROM ophys_experiments as oe
JOIN ophys_sessions as os ON os.id=oe.ophys_session_id
JOIN specimens as sp on sp.id=os.specimen_id
JOIN donors d on d.id = sp.donor_id
JOIN equipment as eq on eq.id = os.equipment_id
JOIN well_known_files as wkf on wkf.attachable_id=oe.id
JOIN well_known_file_types as wkft on wkf.well_known_file_type_id=wkft.id
WHERE wkft.name='OphysDffTraceFile'
AND oe.id IN ({})

Wayne notes that there are valid ROIs associated with an experiment that can have come from previous (non current) segmentation runs. The strategy should provide valid ROI IDs that are present in the OphysDffTraceFile or, in the event that the ruby strategy can't look into the hdf5 file, the strategy can provide all valid ROI IDs and the module itself will determine which ones in the DFF file are valid.

djkapner commented 3 years ago

Before incorporating into ophys_etl, I made some modifications to simplfiy and remove some dependencies, stored here temporarily: https://github.com/AllenInstitute/djkapner_sandbox/tree/master/l0_event_detection

this includes a dockerfile. This dockerfile and l0_analysys_mesoscopy.py will be the starting point to absorb this code into ophys_etl.

djkapner commented 3 years ago

@wbwakeman cc: @matchings There is not yet a specification for what the output of event detection is. Here is my proposal:

Currently, the df/F step creates a well_known_file: OphysDffTraceFile. This file contains the keys: data which is an nROI x nframe array. and also roi_names which is an length = nROI list of ids, lined up with data

I propose that the new event detection module creates a new well_known_file: OphysEventTraceFile This will be exactly the same format as the dff output format (hdf5), but instead of df/F in the data field, it will save the events output of event detection. events output is also nROI x nframe. It is mostly zeros. Where it is non-zero, it indicates an event magnitude. For non-cells (valid_roi=False) it will be all zeros.

@ledochowitsch will that events array capture all needed information about event detection? I think it will. @njmei I imagine that the data access apis will have a function like get_dff_traces called get_event_traces. For example

In this case, the output json of this module will be:

{
  'OphysEventTraceFile`: <path to written file>
}
djkapner commented 3 years ago

Established a version of the code in a new ophys_etl branch and was was able to reproduce above plots using the ophys_etl CI/CD dockerfile.

Job looks like:

#!/bin/bash
#PBS -q braintv
#PBS -l mem=120g
#PBS -l walltime=12:00:00
#PBS -l nodes=1:ppn=32
#PBS -N event_dev
#PBS -j oe
#PBS -o /allen/aibs/informatics/danielk/event_detection/logs
#PBS -t 1-3

inputs=(\
/allen/aibs/informatics/danielk/event_detection/output/792815735_input.json
/allen/aibs/informatics/danielk/event_detection/output/799368904_input.json
/allen/aibs/informatics/danielk/event_detection/output/881003484_input.json
/allen/aibs/informatics/danielk/event_detection/output/955276580_input.json)

image=docker://alleninstitutepika/ophys_etl_pipelines:develop
export SINGULARITY_DOCKER_USERNAME=<>
export SINGULARITY_DOCKER_PASSWORD=<>
TMPDIR=/scratch/capacity/${PBS_JOBID}

SINGULARITY_TMPDIR=${TMPDIR} singularity run \
        --bind /allen:/allen,/scratch/capacity/${PBS_JOBID}:/tmp \
        ${image} \
        /envs/event_detection/bin/python -m ophys_etl.transforms.event_detection \
            --input_json ${inputs[${PBS_ARRAYID}]}

and one of those input jsons looks like:

{
  "movie_frame_rate_hz": 31.0,
  "full_genotype": "Vip-IRES-Cre/wt;A",
  "ophysdfftracefile": "/allen/programs/braintv/production/visualbehavior/prod0/specimen_744911458/ophys_session_792619807/ophys_experiment_792815735/792815735_dff.h5",
  "valid_roi_ids": [858459595, 858459398, 858459609, ... ],
  "output_event_file": "/allen/aibs/informatics/danielk/event_detection/output/792815735.npz"
}
matchings commented 3 years ago

@djkapner regarding event detection output and what we want to save - the events .npz files that Peter has created off pipeline includes additional pieces of information beyond the events trace that may be relevant, including a few metrics (noise_stds and lambda value). Another important thing to consider is whether we want to save both the upsampled 30Hz events trace and the upsampled then downsampled 11Hz trace for mesoscope experiments. Generally, i am in favor of saving anything we think we might want to use in the future, so would advocate to keep the traces at both frame rates, and the metrics. We would only provide one trace to users, but we may want to do QC using both traces to validate the resampling, or use the upsampled versions for some analysis requiring higher temporal resolution.

Here is what is in the events .npz file:

events file is an .npz with the following files within it: dff: array of n_cells x n_timepoints with recalculated dF/F values(at original frame rate) ts: timestamps corresponding to timepoints in dff (at original frame rate) events: array of n_cells x n_timepoints with event magnitudes(at original frame rate) noise stds: array of length(n_cells) giving the value for standard deviation of the noise for all ROIs lambdas: array of length(n_cells) giving the lambda value for all ROIs upsampling_factor: factor used to resample mesoscope data into 30Hz time frame event_dict: event_dict contains one item per cell_roi_id, where each item is a dictionary with 4 keys: mag: event magnitude, sampled at 30Hz idx: (i think) indices into original timestamps before resampling ts: timestamps of events, sampled at 30Hz event_trace: trace of event magnitudes, at original frame rate (11Hz for mesoscope)

djkapner commented 3 years ago

@matchings It is possible to include all of that, but it will take more work. The reason I was leaning towards not doing it is that building that .npz file requires getting timestamps:

ts = dataset.timestamps.ophys_frames.values

which relies on VBA and VBA processing of sync files. We don't want the production pipeline dependent on VBA, so I would have to absorb that processing. (production code is not dependent on VBA nor on AllenSDK, which is one time-consuming aspect of absorbing code like this.) I'm wondering if the processing event detection module should be responsible for producing events and VBA (and/or AllenSDK perhaps) is responsible for interpreting those events in a larger context. There are tickets in our queue that follow this one that will update AllenSDK APIs to handle the output of this processing module.

That said, I could easily incorporate everything else not related to timestamps. In fact, it is already done (though needs a little cleanup).

I would also advocate for an hdf5 file format, rather than numpy-specific. The AllenSDK API should hide this from you anyway.

I can just update the AllenSDK API ticket with your exposition of the npz file. That ticket needs more specification anyway,

matchings commented 3 years ago

I wasnt suggesting building & providing the exact .npz file, i also prefer hdf5 in the same format as the dff_traces hdf5 file in lims. Also I dont think including the timestamps is necessary.

Is it possible to include events traces at both temporal resolutions for mesoscope without timestamps?

djkapner commented 3 years ago

Is it possible to include events traces at both temporal resolutions for mesoscope without timestamps?

yes, it is already there. I'll just shape it up into an hdf5 file.

ledochowitsch commented 3 years ago

I would suggest to include the events traces at the lower resolution, and arrays of event time stamps and matching magnitudes (resolution independent).

-Peter

Get Outlook for iOShttps://aka.ms/o0ukef


From: Dan Kapner notifications@github.com Sent: Thursday, January 7, 2021 11:09:02 AM To: AllenInstitute/ophys_etl_pipelines ophys_etl_pipelines@noreply.github.com Cc: Peter Ledochowitsch peterl@alleninstitute.org; Mention mention@noreply.github.com Subject: Re: [AllenInstitute/ophys_etl_pipelines] Create 2p event detection module (#75)

Is it possible to include events traces at both temporal resolutions for mesoscope without timestamps?

yes, it is already there. I'll just shape it up into an hdf5 file.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAllenInstitute%2Fophys_etl_pipelines%2Fissues%2F75%23issuecomment-756319226&data=04%7C01%7C%7Ca08e1e396ddd4235caa508d8b33fb2b4%7C32669cd6737f4b398bddd6951120d3fc%7C0%7C0%7C637456433454579610%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Bb9gG5zdHkX7cY%2BMMnFn4sTRyAShdSou5bHqkMdEvQ4%3D&reserved=0, or unsubscribehttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABVLHVTY2ESYUKFWBSQYGELSYYBE5ANCNFSM4SAXC5SA&data=04%7C01%7C%7Ca08e1e396ddd4235caa508d8b33fb2b4%7C32669cd6737f4b398bddd6951120d3fc%7C0%7C0%7C637456433454589603%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=hWPfsIdvkBEbSWX9IC9E3VrIYL3xs3k5azMUH%2Fz9EJE%3D&reserved=0.

djkapner commented 3 years ago

@ledochowitsch I had made a simple test case so the code can run quickly through some tests. This was meant to be something really easy for the algorithm (high SNR, no drift, etc). image.png I was discussing with @njmei and it looks like the event index is always 1 less than the fake event time that I specified, and, it looks like this was intentional: out[ev['spikes']-1] = ev['pos_spike_mag']

I just want to confirm you want that off-by-one index in there.

ledochowitsch commented 3 years ago

Ah! Great catch, Dan. This part of the code was written by Michael O. I believe that the 1-offset was intentional to fix an indexing issue in Sean’s code. 7 months ago, Sean appears to have fixed that issue himself in fast.py, so now it’s being ‘fixed’ twice, which is once too many:

https://github.com/jewellsean/FastLZeroSpikeInference/commit/cdfaade68ceb6aa15ec5003c460de4e0575f1d5f

Best,

-Peter

From: Dan Kapner notifications@github.com Date: Friday, January 8, 2021 at 2:44 PM To: AllenInstitute/ophys_etl_pipelines ophys_etl_pipelines@noreply.github.com Cc: Peter Ledochowitsch peterl@alleninstitute.org, Mention mention@noreply.github.com Subject: Re: [AllenInstitute/ophys_etl_pipelines] Create 2p event detection module (#75)

@ledochowitschhttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fledochowitsch&data=04%7C01%7C%7C0ce75f59d26a491cd57c08d8b426f150%7C32669cd6737f4b398bddd6951120d3fc%7C0%7C0%7C637457426656612864%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=iqWeUqA68KGFaJN5HN6UFe7OK4gOydM4KgghD%2BP8J0I%3D&reserved=0 I had made a simple test case so the code can run quickly through some tests. This was meant to be something really easy for the algorithm (high SNR, no drift, etc). [https://camo.githubusercontent.com/ce2c374e8594e84aa2cbb2cd2944240f0c1e3ef06fb56347b9c6cc32551905c0/68747470733a2f2f696d616765732e7a656e68756275736572636f6e74656e742e636f6d2f3564646463633432303637333561303030316461383462622f36653261663638642d393961622d343564312d616165382d626638656139343164393032]https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcamo.githubusercontent.com%2Fce2c374e8594e84aa2cbb2cd2944240f0c1e3ef06fb56347b9c6cc32551905c0%2F68747470733a2f2f696d616765732e7a656e68756275736572636f6e74656e742e636f6d2f3564646463633432303637333561303030316461383462622f36653261663638642d393961622d343564312d616165382d626638656139343164393032&data=04%7C01%7C%7C0ce75f59d26a491cd57c08d8b426f150%7C32669cd6737f4b398bddd6951120d3fc%7C0%7C0%7C637457426656622859%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=nvxxFzxpP0aOZac%2BTW2EG8rKQpB1MpUGY43OwSMXmI4%3D&reserved=0 I was discussing with @njmeihttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnjmei&data=04%7C01%7C%7C0ce75f59d26a491cd57c08d8b426f150%7C32669cd6737f4b398bddd6951120d3fc%7C0%7C0%7C637457426656622859%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=%2FXCJWlmPOebV8XL7iXF0OhSJ74c9hCPT9L0HsPNS5RA%3D&reserved=0 and it looks like the event index is always 1 less than the fake event time that I specified, and, it looks like this was intentional: out[ev['spikes']-1] = ev['pos_spike_mag']

I just want to confirm you want that off-by-one index in there.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAllenInstitute%2Fophys_etl_pipelines%2Fissues%2F75%23issuecomment-757034941&data=04%7C01%7C%7C0ce75f59d26a491cd57c08d8b426f150%7C32669cd6737f4b398bddd6951120d3fc%7C0%7C0%7C637457426656632852%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=3A9dCR7dxeGgEBHbk%2FqYKMXvwfpAv3MKtoWAGgW6Us4%3D&reserved=0, or unsubscribehttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABVLHVSKWADTRRKPSCKYKADSY6DELANCNFSM4SAXC5SA&data=04%7C01%7C%7C0ce75f59d26a491cd57c08d8b426f150%7C32669cd6737f4b398bddd6951120d3fc%7C0%7C0%7C637457426656632852%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Q3mpryBmDkEatVGxY6gIXAs2o4yZYdCSYZjzxwX1rok%3D&reserved=0.

djkapner commented 3 years ago

I was thinking "close enough" but, it wasn't passing @njmei 's sniff test. Thanks Nick!

djkapner commented 3 years ago

There's a fair amount of hand-wringing in the code about whether trace values have NaN's. I checked the 1537 experiments in Peter's output directory. For every valid ROI in 1536 / 1537 experiments, there are no NaNs. For one experiment: 1038171965, the whole file is NaNs - clearly some problem with this one.

I plan to trim back checking for and returning NaNs, as, this appears not to be present in production data coming out of the dff computation queue.

djkapner commented 3 years ago

PR is in review with informatics team. I need further discussion with @ledochowitsch about the impacts of upsampling mesoscope data.

djkapner commented 3 years ago

For a broader data review and comparison to past results, I chose 54 experiments across cre line and rig. These are the 2 oldest and 2 newest experiments from Peter's previous results. I had to throw out some older experiments that have run through an older version of the DFF queue. Those can be recovered, if desired, by rerunning the DFF queue, but, the DFF traces themselves are slightly different. For comparing event detection results to previous results, I thought that DFF change muddied the water and removed them.

full_genotype                                             rig    
Slc17a7-IRES2-Cre/wt;Camk2a-tTA/wt;Ai93(TITL-GCaMP6f)/wt  CAM2P.3    [1042639858, 1042895986, 942043482, 862848066]
                                                          CAM2P.4    [1051011761, 1051216600, 796105304, 783927872]
                                                          CAM2P.5                            [965218395, 964460159]
                                                          MESO.1     [1040982830, 1041171734, 965980160, 886585130]
Slc17a7-IRES2-Cre/wt;Camk2a-tTA/wt;Ai94(TITL-GCaMP6s)/wt  CAM2P.3                            [958681110, 950852107]
                                                          CAM2P.4                            [889777243, 896160394]
                                                          CAM2P.5      [977937585, 974945686, 903403819, 904155140]
Sst-IRES-Cre/wt;Ai148(TIT2L-GC6f-ICL-tTA2)/wt             CAM2P.3    [1052391266, 1051217227, 990381322, 979668410]
                                                          CAM2P.4    [1044446322, 1044678595, 993862620, 994790561]
                                                          CAM2P.5      [967005387, 960351917, 965930965, 965228792]
                                                          MESO.1       [951980477, 959388788, 866518326, 871196365]
Vip-IRES-Cre/wt;Ai148(TIT2L-GC6f-ICL-tTA2)/wt             CAM2P.3    [1043833869, 1047030135, 965928394, 960953590]
                                                          CAM2P.4    [1053605418, 1053774037, 830700800, 830093338]
                                                          CAM2P.5      [809497730, 803736273, 805784331, 808621958]
                                                          MESO.1     [1052877371, 1052877376, 936500606, 935514362]

The new module runs successfully through all of these. I am proposing a modified strategy for mesoscope/11Hz data. I'll use this selection of data to present that proposal and its impacts at the QC-a-thon on Thursday. I am noticing now that there are no examples of Slc17a7-IRES2-Cre/wt;Camk2a-tTA/wt;Ai94(TITL-GCaMP6s)/wt + MESO.1, but I don't see that combo in the release dataset either.

djkapner commented 3 years ago

This is a notebook for exploring event detection comparisons from the experiment to the trace to the event level: /allen/aibs/informatics/danielk/event_detection/event_detection_comparison.ipynb