AllenInstitute / AllenSDK

code for reading and processing Allen Institute for Brain Science data
https://allensdk.readthedocs.io/en/latest/
Other
344 stars 150 forks source link

Create python tool to generate VBN NWB input JSONs #2400

Closed danielsf closed 2 years ago

danielsf commented 2 years ago

Currently, we are dependent on one of the LIMS ruby strategies to generate the input JSONs for VBN NWB generation. The purpose of this ticket is to write a python tool that can exist in the SDK to query lims directly and generate those files.

Example input_jsons to compare against can be found in

/allen/aibs/programs/mindscope/workgroups/np-behavior/vbn_data_release/input_jsons/

These were generated by the ruby strategy. Whatever tool we produce should match these exactly (unless there are still fields like running_speed_path that we are actually no longer using)

Tasks

Validation

danielsf commented 2 years ago

One question that I had about this work was, how do we find the behavior, replay, and mapping stimulus pickle files. Here are all of the pkl well known files associated with a session, along with their corresponding well_known_file_types.name entries

1043752325_506940_20200817.mapping.pkl -- MappingPickle
1043752325_506940_20200817.replay.pkl -- EcephysReplayStimulus
1043752325_506940_20200817.opto.pkl -- OptoPickle
1043752325_506940_20200817.behavior.pkl -- StimulusPickle

So: it looks like we need to find the StimulusPickle, MappingPickle and EcephysReplayStimulus well known files associated with each session.

danielsf commented 2 years ago

I suspect a lot of what we need can be provided by the lims_queries.py module in #2407. After that get merged, we should investigate.

danielsf commented 2 years ago

I looked at an example input JSON generated by Nathan's ruby code. Below is my breakdown of the schema along with my best guesses where the data came from. In some cases, it is obvious (the field exists with the same name in an obvious LIMS database table). In some cases, Nathan's code remapped a LIMS column to a slightly related name (these are identified with "actually..."). In some cases, I am still unsure, but I have guesses (marked with "???").

A lot of the fields that are just columns in LIMS database tables are already returned by the metadata writing code in #2407. We should be able to use the lims_queries.py module in that PR to quickly get most of the data for the units and channels that need to be specified in the input JSON. For columns that are not returned as part of the metadata, it will be easy to expand the lims queries in the metadata writer and then have the metadata writing class pare down the output dataframe so that it excludes the undesired columns.

In cases of paths to well known files, I listed my best guess for the well_known_file_types.name associated with that file. I have not investigated the well_known_files.attachable_id and well_known_files.attachable_type to which those files are attached.

This is the schema of 'session data':

age -- <class 'str'> -- usually derived from donors.date_of_birth and ecephys_sessions.date_of_acquisition
behavior_session_id -- <class 'int'> -- column in behavior_sessions
behavior_stimulus_file <class 'str'> -- wkft.name='StimulusPickle'
date_of_acquisition <class 'str'> -- probably needs to be read from the stimulus pickle
date_of_birth <class 'str'> -- column in donors, which gets linked to ecephys_sessions via specimens table
driver_line <class 'list'> -- ?????
ecephys_session_id <class 'int'> -- column in ecephys_sessions
external_specimen_name <class 'int'> -- column in specimens
eye_dlc_file <class 'str'> -- wkft.name='EyeDlcOutputFile'
eye_tracking_filepath <class 'str'> -- possible wkf.name 'EyeTracking Pupil', 'EyeTracking Ellipses', 'RawEyeTrackingVideo'
eye_tracking_rig_geometry <class 'dict'> -- ?????
face_dlc_file <class 'str'> -- wkft.name='FaceDlcOutputFile'
foraging_id <class 'str'> -- column in behavior_sessions
full_genotype <class 'str'> -- column in donors
mapping_stimulus_file <class 'str'> -- wkft.name='MappingPickle'
monitor_delay <class 'float'> -- this is programmatically set by the SDK, I think
optotagging_table_path <class 'str'> -- wkft.name= 'EcephysOptotaggingTable'
probes <class 'list'> -- see below
raw_eye_tracking_video_meta_data <class 'str'> -- wkft.name = 'RawEyeTrackingVideoMetadata'
replay_stimulus_file <class 'str'> -- wkft.name = 'EcephysReplayStimulus'
reporter_line <class 'list'> -- Unclear... the SDK usually munges full_genotye into reporter and driver lines
rig_name <class 'str'> -- link behavior_sessions -> equpiment and get equipment.name
sex <class 'str'> -- column in donors
side_dlc_file <class 'str'> -- wkft.name = 'SideDlcOutputFile'
stim_table_file <class 'str'> -- wkft.name = 'EcephysStimulusTable'
sync_file <class 'str'> -- wkft.name = 'EcephysRigSync'...???

'probes' is a list of dicts, each representing a probe in the ecephys session. The breakdown of each probe's schema is:

================
channels <class 'list'> -- see below
csd_path <class 'NoneType'> -- This might be a placeholder; it was always None in the example I looked at
id <class 'int'> -- column in ecephys_Probes
inverse_whitening_matrix_path <class 'str'> -- wkft.name='EcephysSortedWhiteningMatInv'
lfp <class 'NoneType'> -- wkft.name = 'EcephysLfpNwb'; this might also be a placeholder, given that we aren't generating this data yet
lfp_sampling_rate <class 'float'> -- probably ecephys_probes.global_probe_lfp_sampling_rate
mean_waveforms_path <class 'str'> -- wkft.name = 'EcephysSortedMeanWaveforms'
name <class 'str'> -- column in ecephys_probes
sampling_rate <class 'float'> -- probably ecephys_probes.global_probe_sampling_rate
spike_amplitudes_path <class 'str'> -- wkft.name = 'EcephysSortedAmplitudes'
spike_clusters_file <class 'str'> -- wkft.name = 'EcephysSortedSpikeClusters' or 'EcephysSortedClusterGroup'; unsure
spike_templates_path <class 'str'> -- wkft.name = 'EcephysSortedSpikeTemplates' (probably)
spike_times_path <class 'str'> -- wkft.name='EcephysSortedSpikeTimes'
templates_path <class 'str'> -- wkft.name = 'EcephysSortedTemplates' (probably)
temporal_subsampling_factor <class 'float'> -- column in ecephys_probes
units <class 'list'> -- see below

channels is a list of dicts representing the channel associated with each probe. The breakdown of each channel's schema is

================
anterior_posterior_ccf_coordinate <class 'float'> -- column in ecephys_channels
dorsal_ventral_ccf_coordinate <class 'float'> -- column in ecephys_channels
id <class 'int'> -- column in ecephys_channels
left_right_ccf_coordinate <class 'float'> -- column in ecephys_channels
local_index <class 'int'> -- column in ecephys_channels
manual_structure_acronym <class 'str'> -- link ecephys_channels.manual_structure_id to structures.acronym
manual_structure_id <class 'int'> -- column in ecephys_channels
probe_horizontal_position <class 'float'> -- column in ecephys_channels
probe_id <class 'int'> -- link to the probe this channel is associated with
probe_vertical_position <class 'float'> -- column in ecephys_channels
valid_data <class 'bool'> -- column in ecephys_channels

units is a list of dicts representing the units (candidate cells) associated with each (probe, channel) pair. The breakdown of each unit's schema is.

===================
PT_ratio <class 'float'> -- column in ecephys_units
amplitude <class 'float'> -- column in ecephys_units
amplitude_cutoff <class 'float'> -- column in ecephys_units
cluster_id <class 'int'> -- actualy ecephys_units.cluster_ids (note the singular versus plural)
cumulative_drift <class 'float'> -- column in ecephys_units
d_prime <class 'float'> -- column in ecephys_units
firing_rate <class 'float'> -- column in ecephys_units
id <class 'int'> -- column in ecephys_units
isi_violations <class 'float'> -- column in ecephys_units
isolation_distance <class 'float'> -- column in ecephys_units
l_ratio <class 'float'> -- column in ecephys_units
local_index <class 'int'> -- column in ecephys_units
max_drift <class 'float'> -- column in ecephys_units
nn_hit_rate <class 'float'> -- column in ecephys_units
nn_miss_rate <class 'float'> -- column in ecephys_units
peak_channel_id <class 'int'> -- ??? unclear; may just be the ID of the channel associated with this unit
presence_ratio <class 'float'> -- column in ecephys_units
quality <class 'str'> -- column in ecephys_units
recovery_slope <class 'float'> -- column in ecephys_units
repolarization_slope <class 'float'> -- column in ecephys_units
silhouette_score <class 'float'> -- column in ecephys_units
snr <class 'float'> -- column in ecephys_units
spread <class 'float'> -- column in ecephys_units
velocity_above <class 'float'> -- column in ecephys_units
velocity_below <class 'NoneType'> -- column in ecephys_units
waveform_duration <class 'float'> -- actually ecephys_units.duration
waveform_halfwidth <class 'float'> -- actually ecephys_units.halfwidth
danielsf commented 2 years ago

Closed by #2438