catalystneuro / datta-lab-to-nwb

MIT License
1 stars 2 forks source link

Initial Inspection #1

Closed pauladkisson closed 2 months ago

pauladkisson commented 1 year ago

Overview

This issue serves as a public notebook for this conversion project

Miscellaneous Notes

Folder/File Descriptions

Files are described in ascending Name order as they appear in the unzipped zenodo dataset

dlight_intermediate_results

3min_max_dlight_0-0.3s_vs_usage_shuffled_agg.parquet

dlight-chrimson_snippets_offline_features.parquet

lagged_analysis_session_bins.toml

performance-prediction-error-distances-and-dopamine.parquet

syllable_stats_photometry_offline.toml

syllable-classifier-from-dlight-amplitude-submission

dlight

shuffle

syllable-classifier-submission

dlight_raw_data

3s-pulsed-stim-dataframe.parquet

dlight_photometry_processed_full_transfer.parquet

dlight_photometry_processed_full.toml

hek_raw_data

keypoints_raw_data

miscellaneous_intermediate_results

autoencoder_characterization.parquet

misc_raw_data

autoencoder_test_data.h5

f1_scores_estimates_actual_calls.parquet

latencies_stim_arduino_test.dat

latencies_stim.parquet

optoda_intermediate_results

behavioral_classes.toml

behavioral-distance.parquet

closed_loop_learners.toml

da-vs-learning-per-syllable.parquet

joint_syllable_map.toml

syllable_stats_offline.toml AND syllable_stats_offline.toml

optoda_raw_data

closed_loop_behavior_transfer.parquet

closed_loop_behavior_velocity_conditioned.parquet

closed_loop_behavior_with_simulated_triggers_transfer.parquet

learning_aggregate.parquet

learning_timecourse_binsize-30.parquet

learning_timecourse_processed_summary.parquet

learning_timecourse_processed.parquet

realtime_package

rl_intermediate_results

rl_model_heldout_results_best_lag_rands.parquet

rl_model_heldout_results_lags.parquet

rl_model_parameters.toml AND rl_model_stats.toml

rl_raw_data

rl_modeling_dlight_data_offline.parquet

rl_modeling_dlight_data_online.parquet

TODOs

Based on initial inspection, I need the figure out the following

CodyCBakerPhD commented 1 year ago

Focusing on dlight_photometry_processed_full_transfer.parquet today

@pauladkisson Can you post a full list of the 87 column names and we can start putting together a list of exclusions (for determined duplicates and other things we don't want to make it to NWB) as well as finalize how to map the remaining important columns to NWB neurodata types

CodyCBakerPhD commented 1 year ago

Summary of 3 identified experiments thus far

Hek

Initial imaging of the area of interest, just a few microscopy images

Photometry

All data stored in dlight_photometry_processed_full_transfer.parquet

Many sessions and many subjects, but thankfully a UUID mapping provided in the corresponding TOML

Optogenetic

All data in 3s-pulsed-stim-dataframe.parquet, will focus on this one after the photometry

pauladkisson commented 1 year ago

Digging into the dlight_raw_data/dlight_photometry_processed_full.parquet file:

87 Total Columns split into semantic groups

pauladkisson commented 1 year ago

Syllable-related Columns

Conclusion: Just use "predicted_syllable (offline)" since it's the one shown in figure 1d.

pauladkisson commented 1 year ago

PCs

Conclusion: Since it's a derived data stream we should omit from NWB file

pauladkisson commented 1 year ago

Session-related Columns

pauladkisson commented 1 year ago

Subject-related Columns

Conclusion: all this info is available in the mouse_id column --> that's what I'll use

pauladkisson commented 1 year ago

Trigger-related Columns

Conclusion: optogenetic triggering info seems to be an artifact of the aggregation process of the 8 opto-da mice with the 6 pure FP mice --> ignore for now until we get to the opto-da experiment data.

pauladkisson commented 1 year ago

Kinematic Data Columns

Conclusion: just keep centroid_x_mm, centroid_y_mm, height_ave_mm, and angle_unwrapped since the other variables can be reconstructed from those 4.

pauladkisson commented 1 year ago

Photometry Columns

Conclusion: Need to keep signal_dff, reference_dff, uv_reference_fit, reference_dff_fit, and summary/metadata fields: filter_params, signal_max, reference_max, signal_reference_corr, snr. All other columns can easily be derived from these ones.

pauladkisson commented 1 year ago

Miscellaneous

Conclusion: Pack up useful metadata like fs and ignore all opto-da stuff for now

CodyCBakerPhD commented 1 year ago

@pauladkisson This is great; I'll go over this in detail later today and provide next steps for assembling the conversion

The basic idea will be to make one data interface for each data stream, and put them together in a NWBConverter

This project reminds me of how the IBL structured the access to their data; that good reference for how the final product should look is then

Notably, in each data interface in the NWBConverter I attached a certain object as an attribute during __init__ - that object was the thing analogous to a single large table, so something you will need here is to attach similar information about which columns of the table need to be loaded and which rows also correspond to one of the mouse IDs (we usually do one NWB file per subject)

CodyCBakerPhD commented 1 year ago

Syllable

Conclusion: Just use "predicted_syllable (offline)" since it's the one shown in figure 1d.

Sounds good

PCA

Conclusion: Since it's a derived data stream we should omit from NWB file

Yep

Session-level

date = full datetime (YYYY-MM-DD HH:MM:SS) of the start of each experimental session

This shall serve as the session_start_time for each NWB file then

timestamp = time series for each 30min session in seconds (0-1800s)

Can you elaborate on this? Is it an array from 0 to 1800? Is it equivalent to np.arange(0, 1800)?

uuid = unique identifier for each session (apparent from Figure 1d notebook and .toml file)

Cool, similar to IBL this will serve as our session_id

unique_idx = integer index for each uuid

Better to use the UUID if this is a one-to-one mapping

session_name/SessionName = description of type of session ex. "Recording Session 1", "Stim session 2", etc. session descriptions are not unique and have typo duplicates such as "Habituation 1", "habituation 1", "habituation 1 " etc.

If you can make a condensed mapping of the roughly 'unique' descriptors from the overlap of the two fields, we can include it in the session_description or to provide more context

session_number = number to describe the order of the session = -21, -20, ..., 0 (not sure why it counts from negative)

Does the order map perfectly to what you would get if you ordered each by the session start time?

session_repeat = 0 for the first session of the day, 1 for the second session

Hmm... seen this before, so some precedence for indicating the number on the day. OK with including this in the session description along with the other annotation

session_total_number = ???

Probably for getting context of run # per day? like, number '2 out of 3' on the day

stim_session = a number for each day of recording = -27, -27, -26, -26, ... (not sure why it counts from negative)

Sounds like a good one to ask about, might be relevant to something (or might not be)

Subject-level

mouse_id/subject_name/SubjectName

Yep, as agreed from the meeting just use mouse_id

area = brain region = 'dls' or 'dms'

Fine to use w/e convention they recorded, but curious if these are actual Allen Brain Atlas references or if they use their own convention or another atlas; see what you can find, otherwise we'll just ask

opsin = 'chrimson' if mouse implanted with optogenetic stimulator, otherwise 'n/a'

Hm.. this one is interesting... normally we'd use the indicator field of an ImagingPlane, but that's for optical imaging

Since this is ogen/photometry I guess it would have to be some metadata annotated at the subject level, which is why they have it here instead

genotype = area + opsin = {None, 'dls-chrimson-dlight', 'dls-dlight', 'dms-dlight'}

In general you can check the NWB schema for Subjects, though actually the docstring for the PyNWB type might be more useful

Looks like this ought to map directly onto the NWB genotype field, and would include the opsin info above so then the opsin could be ignored in the NWB mapping

CodyCBakerPhD commented 1 year ago

Trigger-related

Conclusion: optogenetic triggering info seems to be an artifact of the aggregation process of the 8 opto-da mice with the 6 pure FP mice --> ignore for now until we get to the opto-da experiment data.

Sounds fair given what you've summarized

Mostly looks like metadata about ogen itself and annotation around it, would need the raw ogen data plus Q/A with authors to understand more

Photometry

signal_dff = dlight emission signal (delta F / F0 from blue light component) green signal in figure 1d reference_dff = dlight reference signal (from isosbestic UV component) grey control in figure 1d uv_reference_fit = smoothed UV reference signal (see Methods: Photometry active referencing) dlight_reref = dlight emission signal normalized by uv_reference fit dlight_reref = signal_dff - uv_reference_fit

So interesting to see how different labs use photometry...

This is much more like an ophys treatment (well, segmented ROIs anyway)

We normally see these referred to as baseline (the 'reference_dff'/'reference_fit' here) vs. detrended (the 'reref' here; the 'signal_dff' would be recoverable given both of those so deemed redundant to include all 3), though we often see it applied to flourescence directly rather than the delta'd flourescence, but it ought to be a distributive property so order of operations shouldn't matter

Anyway these look to fit the parameters of the photometry extension pretty well (different response series for each)

Conclusion: Need to keep signal_dff, reference_dff, uv_reference_fit, reference_dff_fit, and summary/metadata fields: filter_params, signal_max, reference_max, signal_reference_corr, snr. All other columns can easily be derived from these ones.

Yep, sounds good enough to prototype

Kinematic

Which set of experiments did this data correspond to? (or was it both?)

camera_timestamp = time stamp recorded by depth camera --> redundant with timestamp

Can you explain this structure a bit more? Is it a vector of timing information, roughly ~30 FPS? Is it irregular? Do you have it for each session?

centroid_x_mm = centroid x-location of the mouse estimated by OpenCV findcountours

Are you able to reproduce it exactly with the OpenCV function? If so can you note what version of OpenCV just for super clear provenance of reproducibility?

velocity_2d_mm = velocity in 2d (x and y) velocity is measured in mm/frame so it needs to be corrected by the sampling rate (30) = sqrt( centroid_x_mm.diff()2 + centroid_y_mm.diff()2 ) velocity_3d_mm = same as above except velocity measures changes in height as well as x-y position = sqrt( centroid_x_mm.diff()2 + centroid_y_mm.diff()2 + height_ave_mm.diff()**2 ) acceleration_2d_mm = acceleration in 2d (x and y), also needs to be correct by sampling rate acceleration_3d_mm = acceleration in 3d (x-y + height) jerk_2d_mm and jerk_3d_mm = same as above for jerk

height_ave_mm = height above the floor in mm

Height will be included as an additional axis on the x/y SpatialSeries (which should be in a Position container)

They didn't estimate the third axis in the velocity and acceleration values? Not that the mouse would be jumping around or anything lol, just curious

I guess that would explain why it's a separate column altogether...

angle and angle unwrapped = should be the orientation of the mouse: angle_unwrapped = angle in radians but drifts throughout the experiment angle = angle in radians but has weird discontinuities (not restricted to the range 0-2pi) using angle_unwrapped seems to be a better option (and tracks better with velocity_angle)

The official Best Practice for data wrapping (OK, OK, it technically only applies to SpatialSeries but this is a sister data type of that)

Note however that Best Practices are not super hard-and-fast rules for the most part; that one especially comes down to whichever form you think is more useful in the context of understanding the paper or the dataset

So I think it would be fine to use the unwrapped version as you indicate

velocity_angle = angular velocity in rad/s (with correction needed ) matches figure 1d BUT velocity_angle != height_ave_mm.diff() 30 instead it is slightly different and flipped by a factor of (-1)... velocity_height = height velocity in mm/s (with correction needed) BUT not consistent with Figure 1d height velocity from figure 1d = height_ave_mm.diff(2) 30 / 2 (average 2-frame height velocity) but it's not equal to velocity_height in the dataframe...

Anything that doesn't match the paper should be confirmed and brought to authors attention; were you able to run the notebook to generate the figure in this way? Or otherwise see anything in the notebook that might explain the discrepency?

width_mm and length_mm = ??? maybe the width of the bounding box for the mice, but they fluctuate a lot: length = (21, 75), width = (18, 33) for just one session

Might be good to confirm with the author; bounding information could be useful metadata to annotate

movement_initations = moments when mice transitioned from stillness to motion representing by an incrementing variable see Methods: Movement initiation analyses

This could be useful to keep as events; see the ndx-events extension

Conclusion: just keep centroid_x_mm, centroid_y_mm, height_ave_mm, and angle_unwrapped since the other variables can be reconstructed from those 4.

Interesting, they didn't track any key points on the subject other than the centroid? Normally pose estimation involves more detailed skeletons with several nodes and optional edges between them

While the velocity and acceleration can technically be calculated from the position we do often just include it in the file as the TimeSeries Ben mentioned for visualization purposes (sorry if that's confusing against our rule of no 'derived' data; it's more like, 'some' derived data; more precisely it comes down to how many and if any hyper-parameters went into calculating the derived data)

CodyCBakerPhD commented 1 year ago

Misc

Remind me though, these are all columns of the dlight_raw_photometry file right? Trying to understand why there's ogen related stuff in here (or did they mix ogen and photom for some sessions?)

fs = sampling_frequency = 30Hz

sampling frequency of what though? That's ecephys electrode level resolution

stim_frequency = optogenetic stimulation frequency = 25Hz stim_duration = optogenetic stimulation duration = {0.25, 2.0, 3.0} seconds pulse_width = 0.005 seconds = 5ms

Sounds like useful ogen metadata

frame_index = index of timestamp

Man, this is where I get really confused again with the timestamp field, which I thought now was the timestamps of the behavior camera?

feedback_status = {-1, 0, 1} = some kind of metadata, not sure what tho

Sounds like should ask author on this one

CodyCBakerPhD commented 1 year ago

Looks like plenty to get started then; I would suggest a separate, modular PR adding a separate DataInterface for each column that uniquely maps onto one or more neurodata types; the velocity, position, acceleration can all be one data interface for example; the most detailed interface will probably be the photometry values, which will add those various ROI response series (DFF traces) and their constituent links

Also feel free to get started on a basic NWBConverter that fills in all the NWB file and Subject metadata for a single session

As always let me know if (or when) you have any questions~

pauladkisson commented 1 year ago

Inconsistencies with Kinematic Data

I was trying to replicate Figure 1d from the base kinematic variables (x, y, height, angle) and I discovered some inconsistencies between the figure panels, the data, and how they are described in the methods.

I haven't fully explored all the downstream analysis, so I'm not sure how they use these kinematic variables, but based on the notebook that plots Figure 1d, I would guess there will be divergences like these in how they deal with units mm/(n*frame^x) --> mm/s^x.

@CodyCBakerPhD, how do you think we should deal with this in NWB file? The ideas I have are to

  1. Blindly store all the kinematic variables in the dataframe and make a note of the units (mm/frame, mm/2-frame^2, etc.)
  2. Correct the units of the kinematic variables to SI (m/s, m/s^2, etc.) and potentially alter downstream analysis (or correct downstream analysis?)
  3. Insist on data that is more raw ex. depth video and just store that
CodyCBakerPhD commented 1 year ago

1/2. m/s seems like the best units to write to NWB. It's a Best Practice to use SI units and 'frame' is kind of like 'pixel'; it's acceptable if there is no known way to convert it to scientific units, but it seems in this case there is.

  1. We should also have the raw video for the position tracking, yes. That would enable finer grain pose tracking in the future if someone wanted to
pauladkisson commented 1 year ago

timestamp = time series for each 30min session in seconds (0-1800s)

Can you elaborate on this? Is it an array from 0 to 1800? Is it equivalent to np.arange(0, 1800)?

Almost but not quite. For example, the example session (used in Figure 1d) starts with a timetamp = 3.333375s, has some NaNs, as well as some small deviations from the stated sampling frequency, and a few entries that skip by 2dt.

pauladkisson commented 1 year ago

area = brain region = 'dls' or 'dms'

Fine to use w/e convention they recorded, but curious if these are actual Allen Brain Atlas references or if they use their own convention or another atlas; see what you can find, otherwise we'll just ask

They have exact stereotactic coordinates (AP, ML, DV) for the DLS implantation in the methods (Stereotaxic surgery for open field photometric recordings).

pauladkisson commented 1 year ago

session_number = number to describe the order of the session = -21, -20, ..., 0 (not sure why it counts from negative)

Does the order map perfectly to what you would get if you ordered each by the session start time?

It does for the non-stimulated mice (dls-dlight-1, dls-dlight-2, ..., dms-dlight-1, ...). But it does not for the optogenetically stimulated mice (dlight-chrimson-1, ...). For those mice the session numbers repeat every 4-6 sessions ex. (-1, 0, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, ...)

pauladkisson commented 1 year ago

centroid_x_mm = centroid x-location of the mouse estimated by OpenCV findcountours

Are you able to reproduce it exactly with the OpenCV function? If so can you note what version of OpenCV just for super clear provenance of reproducibility?

I don't have access to the raw data to be able to check this. The note about OpenCV findcontours is using a quote from the methods of the paper: Methods: Pre-Processing "Next, the location of the mouse was identified by finding the centroid of the contour with the largest area using the OpenCV findcontours function."

pauladkisson commented 1 year ago

Anything that doesn't match the paper should be confirmed and brought to authors attention; were you able to run the notebook to generate the figure in this way? Or otherwise see anything in the notebook that might explain the discrepency?

Yes, I have been using one of the notebook files in their repo: notebook_panels/fig_1d-dLight example.ipynb. I can show you tomorrow in our meeting in more detail.

pauladkisson commented 1 year ago

Interesting, they didn't track any key points on the subject other than the centroid? Normally pose estimation involves more detailed skeletons with several nodes and optional edges between them

They definitely did track keypoints (See extended data Fig. 3). And there's even a keypoints_raw_data folder in the data repo. But everything inside is Matlab .p files, which are generally like executables (I think?), so not really accessible. Probably something I need to look into tho.

CodyCBakerPhD commented 1 year ago

They have exact stereotactic coordinates (AP, ML, DV) for the DLS implantation in the methods (Stereotaxic surgery for open field photometric recordings).

Oh, I meant the acronym (the abbreviation for the brain area), to which my question still holds (are they Allen Mouse conventions)?

For exact coordinates, are they CCFv3?

Almost but not quite. For example, the example session (used in Figure 1d) starts with a timetamp = 3.333375s, has some NaNs, as well as some small deviations from the stated sampling frequency, and a few entries that skip by 2dt.

Which dt is that? That is, what data stream are these timestamps for then? Since they are irregular we will definitely want to use them

It does for the non-stimulated mice (dls-dlight-1, dls-dlight-2, ..., dms-dlight-1, ...). But it does not for the optogenetically stimulated mice (dlight-chrimson-1, ...). For those mice the session numbers repeat every 4-6 sessions ex. (-1, 0, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, ...)

Might be something to ask about then

pauladkisson commented 1 year ago

This issue is getting crowded, so moving future experiment/data notes to separate issues (see #11 )