pauladkisson commented 1 year ago

Overview

This issue serves as a public notebook for this conversion project

Miscellaneous Notes

_reformat_zenodo_downloads.ipynb requires 160GB of RAM (Large instance of Dandihub only has 64GB), so running the notebooks provided by the Datta Lab is going to be tricky
- seems like compression has made some of the data unreadable --> need to figure out decompression
matlab p code files are skipped until I figure out how to deal with them

Folder/File Descriptions

Files are described in ascending Name order as they appear in the unzipped zenodo dataset

dlight_intermediate_results

3min_max_dlight_0-0.3s_vs_usage_shuffled_agg.parquet

size = 3.2MB
shape = (281, 1000)
Index = lag = [-7.00, -6.95, ..., 7.00] = lag in ms?
Columns = [0, 1, ..., 999] = syllables?
Values = float64 ranging from -0.014 to 0.016 = dLight?

dlight-chrimson_snippets_offline_features.parquet

size = 611MB
shape = (11066130 rows × 52 columns)
Index = snippet = [0, 0, 0, ..., 0, 1, ..., 1, ..., 1844354] = snippet in time?
Columns = [signal_<xyz> = NaN, window]
signal columns are all NaNs --> compressed?
window = pandas.Interval that ranges from -0.2 to 1.0 (seconds?)

lagged_analysis_session_bins.toml

session_bins = [ 0, 360, 660, 900, 1680, 1800,] = session starts in seconds?

performance-prediction-error-distances-and-dopamine.parquet

size = 38MB
shape = 558431 rows × 14 columns
Index = 0, ..., 2345
syllable = 4-95 = syllable expressed by mouse
uuid = unique id for that syllable
mouse = mouse from which syllable comes (ex. dls-dlight-1)
da = Dopamine as measured by dLight?
velocity_2d_mm, height_ave_mm, etc. = data for each syllable

syllable_stats_photometry_offline.toml

syllable_to_sorted_idx = some mapping from each syllable ID to some kind of sorting index
sorted_idx_to_syllable reverse mapping
usages = some kind of percentage usage for each syllable
duration = mean, median, and std of duration for each syllable?

syllable-classifier-from-dlight-amplitude-submission

dlight

rf-classifier-sample-.parquet where X ranges from 0000 to 0499
Each file contains a df with 44 entries
columns are accuracy, n_classes threshold and type=dLight

shuffle

same as dlight, but type=shuffle

syllable-classifier-submission

same as syllable-classifier-from-dlight-amplitude-submission but the RL model was trained differently?

dlight_raw_data

3s-pulsed-stim-dataframe.parquet

size = 183MB
2575275 rows × 27 columns
Index = 161661, ..., 9406091
predicted_syllable = 4-95
target_syllable = 17-76
stim_session = 1
mouse_id and uuid are unique ids for each mouse
other miscellaneous data fields and metadata fields

dlight_photometry_processed_full_transfer.parquet

size = 8.7GB
shape = (48333962, 87)
predicted syllable
labels = true syllable
pc00-pc09 = principle components
target syllable
trigger syllable
trigger information (time , duration, etc.)
dlight information = some NaNs
position/velocity/acceleration
mouse_id/uuid

dlight_photometry_processed_full.toml

maps from ID + session --> signal_max, reference_corr, etc.
looks like redundant information from dlight_photometry_processed_full_transfer.parquet

hek_raw_data

several .tif images with experiment #, date, and time

keypoints_raw_data

Bunch of .p files

miscellaneous_intermediate_results

some .p files and

autoencoder_characterization.parquet

size = 36KB
1338 rows × 7 columns
label match = score 0-1
frame mse = MSE in time?
pc mse = MSE for principle components?
jitter value = -40.0-40 = ???
type = "raw" | "ae"
jitter type = "scale" | "jitter"
jitter value 2 = NaN for jitter_type="scale" | -8.0-8.0 for jitter_type="jitter"

misc_raw_data

autoencoder_test_data.h5

size = 59MB
frames of data
timestamps of data
neural network autoencoder with weights and metadata
various latents of differing sizes

f1_scores_estimates_actual_calls.parquet

size = 25MB
shape = (60359509, 11)
true_labels = syllable
area = brain region ex. vta (axon)
sex
opsin
feedback call = bool = ??
is_target = bool
syllable_num = 1, ..., 3926 = number of syllable in order???
target_syllable = syllable targeted?

latencies_stim_arduino_test.dat

some raw data file --> need to figure out how to read

latencies_stim.parquet

size = 60KB
target = 17- 76 = targeted syllable?
latency = -1346-531 = latency in ms?
latency_type = "onset" | "offset"

optoda_intermediate_results

behavioral_classes.toml

maps from syllable name --> syllable number ex. dive = [64, 65, 75, 95, 19, 71, 8]

behavioral-distance.parquet

size = 83KB
67 rows × 67 columns
some kind of pairwise distance metric (between syllables?)

closed_loop_learners.toml

a list of mouse_ids corresponding to "learners"

da-vs-learning-per-syllable.parquet

size = 77KB
shape = 48 rows × 65 columns
for each mouse_id, information about the syllables used by that mouse (usage, stimulation info, etc.)

joint_syllable_map.toml

map of equivalent syllables

syllable_stats_offline.toml AND syllable_stats_offline.toml

summary of info in da-vs-learning-per-syllable.parquet

optoda_raw_data

closed_loop_behavior_transfer.parquet

size = 48GB
shape = (120917144, 152)
primary raw data for all experiments
SessionName = "Recording Session X"
x and y position (mm)
date
predicted syllable numbers
velocity/acceleration/etc.

closed_loop_behavior_velocity_conditioned.parquet

size = 380MB
shape = (8566896, 29)
behavioral data (syllable, position, velocity, etc.)

closed_loop_behavior_with_simulated_triggers_transfer.parquet

size = 2GB
shape = (62409927, 43)
data with triggers

learning_aggregate.parquet

size = 8MB
shape = (221600, 29)
syllable info = count, usage, fold_change, etc.
different types of experiment ex. excitation, pulsed_photometry, etc.

learning_timecourse_binsize-30.parquet

size = 170MB
shape = (10001520, 32)

learning_timecourse_processed_summary.parquet

size = 686KB
shape = (2874, 56)
overal summary of learning experiments

learning_timecourse_processed.parquet

size = 29MB
shape = (172440, 56)
same as learning_timecourse_processed_summary.parquet except with more detailed data

realtime_package

metadata .yaml file with various hyperparameters
autoencoder.h5 = all the weights for the autoencoder
flip_detector.h5 = all the weights for the "flip detector" = ???
pc_components.h5 = principle components and derived metrics (ex. explained variance)
some .p files

rl_intermediate_results

some rl_model .p code

rl_model_heldout_results_best_lag_rands.parquet

size = 163KB
shape = (10000, 7)
loss = prediction loss
func = type of function = "simulate"
batch = 0-99
repeat = 0-49
r_ tm = ???

rl_model_heldout_results_lags.parquet

same as above except with extra field model = "dynamic" | "static"

rl_model_parameters.toml AND rl_model_stats.toml

hyperparameters for RL models

rl_raw_data

rl_modeling_dlight_data_offline.parquet

size = 230MB
shape = (2770162, 47)
lots of variables about the signal
syllable, target_syllable, etc.
some NaNs --> compression

rl_modeling_dlight_data_online.parquet

300MB
shape = (3561427, 47)
same as above except computed online?

TODOs

Based on initial inspection, I need the figure out the following

[x] Decompressing large files without needing crazy amounts of RAM
- Decompressed with pyarrow's PqWriter machinery, but was unable to partition by columns
[ ] Understand Matlab .p code and determine whether we need it or not
[ ] Understand .dat files
[x] offline vs online?

CodyCBakerPhD commented 1 year ago

Focusing on dlight_photometry_processed_full_transfer.parquet today

@pauladkisson Can you post a full list of the 87 column names and we can start putting together a list of exclusions (for determined duplicates and other things we don't want to make it to NWB) as well as finalize how to map the remaining important columns to NWB neurodata types

CodyCBakerPhD commented 1 year ago

Summary of 3 identified experiments thus far

Hek

Initial imaging of the area of interest, just a few microscopy images

Photometry

All data stored in dlight_photometry_processed_full_transfer.parquet

Many sessions and many subjects, but thankfully a UUID mapping provided in the corresponding TOML

Optogenetic

All data in 3s-pulsed-stim-dataframe.parquet, will focus on this one after the photometry

pauladkisson commented 1 year ago

Digging into the dlight_raw_data/dlight_photometry_processed_full.parquet file:

Overall this file seems to relate to the experiment where they correlated DA in the DLS (measured by FP) to spontaneous behavior (measured by depth camera --> split into syllables by MoSeq) WITHOUT applying any optogenetic manipulations.
However, upon closer inspection of the methods, it looks like they aggregated the n=6 mice from open-field FP experiment with the n=8 mice from the opto-DA experiments (no-stim sessions only) for a total of n=14 mice
Also, they measured DA in the DMS as a comparison in n=8 mice
As a result of this aggregation, the dlight_raw_data/dlight_photometry_processed_full.parquet file has fields relating to optogenetic stimulation as well as just the fiber photometry

87 Total Columns split into semantic groups

pauladkisson commented 1 year ago

Syllable-related Columns

"predicted_syllable (offline)" = syllable identified by offline variant of MoSeq
- 57 positive syllable integers + {-5} --> catch-all 'other' category?
- but paper only 37 most common syllables (>1% frequency)
- A substantial # of NaNs = 50418/48,333,962 = 0.1% of rows
- This is the field used for syllable in Figure 1d
predicted_syllable = syllable identified by online variant of MoSeq
- same 57 positive syllable integers + {-1} AND the {-1} values do NOT align perfectly with the {-5} values from predicted_syllable (offline)
- no NaNs
labels = predicted syllable identify from RF classifier?
- 57 unique syllable integers but paper only 37 most common syllables (>1% frequency)
- A substantial # of NaNs = 365,433/48,333,962 = 0.75% of rows
target_syllable = 6 syllable integers targeted by optogenetic stim the previous day + {-5} --> no targeted syllable?
- {-5} is the most common target label (37% of total), indicating that this is probably the value for no target (ex. for the 6 truly no-stim mice)
syllable_group = 7 different integer syllable groups (0-7) as determined by hierarchical clustering on MoSeq distance

Conclusion: Just use "predicted_syllable (offline)" since it's the one shown in figure 1d.

pauladkisson commented 1 year ago

PCs

10 principle components of depth video used to train MoSeq AR-HMM
stored as pc0-pc9

Conclusion: Since it's a derived data stream we should omit from NWB file

pauladkisson commented 1 year ago

Session-related Columns

date = full datetime (YYYY-MM-DD HH:MM:SS) of the start of each experimental session
start_time and StartTime are identical to date but less complete (some NaNs) --> stick with date
timestamp = time series for each 30min session in seconds (0-1800s)
uuid = unique identifier for each session (apparent from Figure 1d notebook and .toml file)
unique_idx = integer index for each uuid
session_name = description of type of session ex. "Recording Session 1", "Stim session 2", etc.
- session descriptions are not unique and have typo duplicates such as "Habituation 1", "habituation 1", "habituation 1 " etc.
SessionName = just like session_name but with slightly different names
- session_name and SessionName do not overlap i.e. all entries with a valid session_name have None for SessionName and vice versa
session_number = number to describe the order of the session = -21, -20, ..., 0 (not sure why it counts from negative)
session_repeat = 0 for the first session of the day, 1 for the second session
session_total_number = ???
stim_session = a number for each day of recording = -27, -27, -26, -26, ... (not sure why it counts from negative)

pauladkisson commented 1 year ago

Subject-related Columns

mouse_id = unique ID for each mouse in the experiment with some info about that mouse ex. dlight-chrimson-1, dls-dlight-1, etc.
subject_name = same as mouse_id except with typo duplicates (ex. dlight-crimson-1, dlight-chrimson-1) and some None's --> stick with mouse_id
SubjectName = same as mouse_id with some None's --> stick with mouse_id
area = brain region = 'dls' or 'dms'
opsin = 'chrimson' if mouse implanted with optogenetic stimulator, otherwise 'n/a'
genotype = area + opsin = {None, 'dls-chrimson-dlight', 'dls-dlight', 'dms-dlight'}

Conclusion: all this info is available in the mouse_id column --> that's what I'll use

pauladkisson commented 1 year ago

Trigger-related Columns

all trigger-related columns are NaN for the same set of entries
trigger_syllable = {95} or NaN = ???
- NaN most of the time: 31734505/48333962 = 66%
- 95 isn't even one of the target syllables...
trigger_time = 0.25 (seconds?)
trigger_threshold = {0.1, 1.0} = ???
trigger_syllable_scalar_comparison = 'gt' = ground truth?
trigger_syllable_scalar_threshold = 'mean'
trigger_syllable_scalar = velocity_2d_mm --> this triggering appears to be related to the velocity opto-da experiments?
trigger_syllable_scalar_baseline_duration = 5.0 = duration of stim? (doesn't seem to be consistent with the paper)

Conclusion: optogenetic triggering info seems to be an artifact of the aggregation process of the 8 opto-da mice with the 6 pure FP mice --> ignore for now until we get to the opto-da experiment data.

pauladkisson commented 1 year ago

Kinematic Data Columns

camera_timestamp = time stamp recorded by depth camera --> redundant with timestamp
centroid_x_mm = centroid x-location of the mouse estimated by OpenCV findcountours
centroid_y_mm = same as above for y-location
velocity_2d_mm = velocity in 2d (x and y)
- velocity is measured in mm/frame so it needs to be corrected by the sampling rate (30)
- = sqrt( centroid_x_mm.diff()**2 + centroid_y_mm.diff()**2 )
velocity_3d_mm = same as above except velocity measures changes in height as well as x-y position
- = sqrt( centroid_x_mm.diff()**2 + centroid_y_mm.diff()**2 + height_ave_mm.diff()**2 )
acceleration_2d_mm = acceleration in 2d (x and y), also needs to be correct by sampling rate
- = velocity_2d_mm.diff(2)
acceleration_3d_mm = acceleration in 3d (x-y + height)
jerk_2d_mm and jerk_3d_mm = same as above for jerk
height_ave_mm = height above the floor in mm
angle and angle unwrapped = should be the orientation of the mouse:
- angle_unwrapped = angle in radians but drifts throughout the experiment
- angle = angle in radians but has weird discontinuities (not restricted to the range 0-2pi)
- using angle_unwrapped seems to be a better option (and tracks better with velocity_angle)
velocity_angle = angular velocity in rad/s (with correction needed )
- matches figure 1d BUT velocity_angle != angle_unwrapped.diff(2)
- instead velocity_angle = angle_unwrapped.diff(2) * -1
velocity_height = height velocity in mm/s (with correction needed) BUT not consistent with Figure 1d
- height velocity from figure 1d = height_ave_mm.diff(2) * 30 / 2 (average 2-frame height velocity) but it's not equal to velocity_height in the dataframe...
width_mm and length_mm = ???
- maybe the width of the bounding box for the mice, but they fluctuate a lot: length = (21, 75), width = (18, 33) for just one session
movement_initations = moments when mice transitioned from stillness to motion representing by an incrementing variable
- see Methods: Movement initiation analyses

Conclusion: just keep centroid_x_mm, centroid_y_mm, height_ave_mm, and angle_unwrapped since the other variables can be reconstructed from those 4.

pauladkisson commented 1 year ago

Photometry Columns

signal_dff = dlight emission signal (delta F / F0 from blue light component)
- green signal in figure 1d
reference_dff = dlight reference signal (from isosbestic UV component)
- grey control in figure 1d
uv_reference_fit = smoothed UV reference signal (see Methods: Photometry active referencing)
dlight_reref = dlight emission signal normalized by uv_reference fit
- dlight_reref = signal_dff - uv_reference_fit
dlight_reref_zscore = zscore(dlight_reref)
dlight_reref_zscore_filter = filtered version of dlight_reref_zscore using the parameters specified by filter_params
reference_dff_fit = smoothed UV reference dff (see Methods: Photometry active referencing)
signal_reref_dff = dlight emission signal normalized by reference_dff_fit
signal_z = zscore(signal); where can be various signals ex. signal_dff_z, reference_dff_z, etc.
_deriv = derivative of x; where can be various signals ex. signal_dff_z, reference_dff_z, etc.
signal_max, reference_max, signal_reference_corr, snr = summary stats for each session

Conclusion: Need to keep signal_dff, reference_dff, uv_reference_fit, reference_dff_fit, and summary/metadata fields: filter_params, signal_max, reference_max, signal_reference_corr, snr. All other columns can easily be derived from these ones.

pauladkisson commented 1 year ago

Miscellaneous

fs = sampling_frequency = 30Hz
stim_frequency = optogenetic stimulation frequency = 25Hz
stim_duration = optogenetic stimulation duration = {0.25, 2.0, 3.0} seconds
pulse_width = 0.005 seconds = 5ms
realtime-package = realtime-package-v6 = some kind of metadata?
frame_index = index of timestamp
feedback_status = {-1, 0, 1} = some kind of metadata, not sure what tho
feedback_misc00 = { 25.0} = looks redundant with stim_frequency
ir_indices = NaN for all non-optoDA mice --> ignore for now

Conclusion: Pack up useful metadata like fs and ignore all opto-da stuff for now

CodyCBakerPhD commented 1 year ago

@pauladkisson This is great; I'll go over this in detail later today and provide next steps for assembling the conversion

The basic idea will be to make one data interface for each data stream, and put them together in a NWBConverter

This project reminds me of how the IBL structured the access to their data; that good reference for how the final product should look is then

Notably, in each data interface in the NWBConverter I attached a certain object as an attribute during __init__ - that object was the thing analogous to a single large table, so something you will need here is to attach similar information about which columns of the table need to be loaded and which rows also correspond to one of the mouse IDs (we usually do one NWB file per subject)

CodyCBakerPhD commented 1 year ago

Syllable

Conclusion: Just use "predicted_syllable (offline)" since it's the one shown in figure 1d.

Sounds good

PCA

Conclusion: Since it's a derived data stream we should omit from NWB file

Yep

Session-level

date = full datetime (YYYY-MM-DD HH:MM:SS) of the start of each experimental session

This shall serve as the session_start_time for each NWB file then

timestamp = time series for each 30min session in seconds (0-1800s)

Can you elaborate on this? Is it an array from 0 to 1800? Is it equivalent to np.arange(0, 1800)?

uuid = unique identifier for each session (apparent from Figure 1d notebook and .toml file)

Cool, similar to IBL this will serve as our session_id

unique_idx = integer index for each uuid

Better to use the UUID if this is a one-to-one mapping

session_name/SessionName = description of type of session ex. "Recording Session 1", "Stim session 2", etc. session descriptions are not unique and have typo duplicates such as "Habituation 1", "habituation 1", "habituation 1 " etc.

If you can make a condensed mapping of the roughly 'unique' descriptors from the overlap of the two fields, we can include it in the session_description or to provide more context

session_number = number to describe the order of the session = -21, -20, ..., 0 (not sure why it counts from negative)

Does the order map perfectly to what you would get if you ordered each by the session start time?

session_repeat = 0 for the first session of the day, 1 for the second session

Hmm... seen this before, so some precedence for indicating the number on the day. OK with including this in the session description along with the other annotation

session_total_number = ???

Probably for getting context of run # per day? like, number '2 out of 3' on the day

stim_session = a number for each day of recording = -27, -27, -26, -26, ... (not sure why it counts from negative)

Sounds like a good one to ask about, might be relevant to something (or might not be)

Subject-level

mouse_id/subject_name/SubjectName

Yep, as agreed from the meeting just use mouse_id

area = brain region = 'dls' or 'dms'

Fine to use w/e convention they recorded, but curious if these are actual Allen Brain Atlas references or if they use their own convention or another atlas; see what you can find, otherwise we'll just ask

opsin = 'chrimson' if mouse implanted with optogenetic stimulator, otherwise 'n/a'

Hm.. this one is interesting... normally we'd use the indicator field of an ImagingPlane, but that's for optical imaging

Since this is ogen/photometry I guess it would have to be some metadata annotated at the subject level, which is why they have it here instead

genotype = area + opsin = {None, 'dls-chrimson-dlight', 'dls-dlight', 'dms-dlight'}

In general you can check the NWB schema for Subjects, though actually the docstring for the PyNWB type might be more useful

Looks like this ought to map directly onto the NWB genotype field, and would include the opsin info above so then the opsin could be ignored in the NWB mapping

CodyCBakerPhD commented 1 year ago

Trigger-related

Conclusion: optogenetic triggering info seems to be an artifact of the aggregation process of the 8 opto-da mice with the 6 pure FP mice --> ignore for now until we get to the opto-da experiment data.

Sounds fair given what you've summarized

Mostly looks like metadata about ogen itself and annotation around it, would need the raw ogen data plus Q/A with authors to understand more

Photometry

signal_dff = dlight emission signal (delta F / F0 from blue light component) green signal in figure 1d reference_dff = dlight reference signal (from isosbestic UV component) grey control in figure 1d uv_reference_fit = smoothed UV reference signal (see Methods: Photometry active referencing) dlight_reref = dlight emission signal normalized by uv_reference fit dlight_reref = signal_dff - uv_reference_fit

So interesting to see how different labs use photometry...

This is much more like an ophys treatment (well, segmented ROIs anyway)

We normally see these referred to as baseline (the 'reference_dff'/'reference_fit' here) vs. detrended (the 'reref' here; the 'signal_dff' would be recoverable given both of those so deemed redundant to include all 3), though we often see it applied to flourescence directly rather than the delta'd flourescence, but it ought to be a distributive property so order of operations shouldn't matter

Anyway these look to fit the parameters of the photometry extension pretty well (different response series for each)

Conclusion: Need to keep signal_dff, reference_dff, uv_reference_fit, reference_dff_fit, and summary/metadata fields: filter_params, signal_max, reference_max, signal_reference_corr, snr. All other columns can easily be derived from these ones.

Yep, sounds good enough to prototype

Kinematic

Which set of experiments did this data correspond to? (or was it both?)

camera_timestamp = time stamp recorded by depth camera --> redundant with timestamp

Can you explain this structure a bit more? Is it a vector of timing information, roughly ~30 FPS? Is it irregular? Do you have it for each session?

centroid_x_mm = centroid x-location of the mouse estimated by OpenCV findcountours

Are you able to reproduce it exactly with the OpenCV function? If so can you note what version of OpenCV just for super clear provenance of reproducibility?

velocity_2d_mm = velocity in 2d (x and y) velocity is measured in mm/frame so it needs to be corrected by the sampling rate (30) = sqrt( centroid_x_mm.diff()2 + centroid_y_mm.diff()2 ) velocity_3d_mm = same as above except velocity measures changes in height as well as x-y position = sqrt( centroid_x_mm.diff()2 + centroid_y_mm.diff()2 + height_ave_mm.diff()**2 ) acceleration_2d_mm = acceleration in 2d (x and y), also needs to be correct by sampling rate acceleration_3d_mm = acceleration in 3d (x-y + height) jerk_2d_mm and jerk_3d_mm = same as above for jerk

height_ave_mm = height above the floor in mm

Height will be included as an additional axis on the x/y SpatialSeries (which should be in a Position container)

They didn't estimate the third axis in the velocity and acceleration values? Not that the mouse would be jumping around or anything lol, just curious

I guess that would explain why it's a separate column altogether...

angle and angle unwrapped = should be the orientation of the mouse: angle_unwrapped = angle in radians but drifts throughout the experiment angle = angle in radians but has weird discontinuities (not restricted to the range 0-2pi) using angle_unwrapped seems to be a better option (and tracks better with velocity_angle)

The official Best Practice for data wrapping (OK, OK, it technically only applies to SpatialSeries but this is a sister data type of that)

Note however that Best Practices are not super hard-and-fast rules for the most part; that one especially comes down to whichever form you think is more useful in the context of understanding the paper or the dataset

So I think it would be fine to use the unwrapped version as you indicate

velocity_angle = angular velocity in rad/s (with correction needed ) matches figure 1d BUT velocity_angle != height_ave_mm.diff() 30 instead it is slightly different and flipped by a factor of (-1)... velocity_height = height velocity in mm/s (with correction needed) BUT not consistent with Figure 1d height velocity from figure 1d = height_ave_mm.diff(2) 30 / 2 (average 2-frame height velocity) but it's not equal to velocity_height in the dataframe...

Anything that doesn't match the paper should be confirmed and brought to authors attention; were you able to run the notebook to generate the figure in this way? Or otherwise see anything in the notebook that might explain the discrepency?

width_mm and length_mm = ??? maybe the width of the bounding box for the mice, but they fluctuate a lot: length = (21, 75), width = (18, 33) for just one session

Might be good to confirm with the author; bounding information could be useful metadata to annotate

movement_initations = moments when mice transitioned from stillness to motion representing by an incrementing variable see Methods: Movement initiation analyses

This could be useful to keep as events; see the ndx-events extension

Conclusion: just keep centroid_x_mm, centroid_y_mm, height_ave_mm, and angle_unwrapped since the other variables can be reconstructed from those 4.

Interesting, they didn't track any key points on the subject other than the centroid? Normally pose estimation involves more detailed skeletons with several nodes and optional edges between them

While the velocity and acceleration can technically be calculated from the position we do often just include it in the file as the TimeSeries Ben mentioned for visualization purposes (sorry if that's confusing against our rule of no 'derived' data; it's more like, 'some' derived data; more precisely it comes down to how many and if any hyper-parameters went into calculating the derived data)

CodyCBakerPhD commented 1 year ago

Misc

Remind me though, these are all columns of the dlight_raw_photometry file right? Trying to understand why there's ogen related stuff in here (or did they mix ogen and photom for some sessions?)

fs = sampling_frequency = 30Hz

sampling frequency of what though? That's ecephys electrode level resolution

stim_frequency = optogenetic stimulation frequency = 25Hz stim_duration = optogenetic stimulation duration = {0.25, 2.0, 3.0} seconds pulse_width = 0.005 seconds = 5ms

Sounds like useful ogen metadata

frame_index = index of timestamp

Man, this is where I get really confused again with the timestamp field, which I thought now was the timestamps of the behavior camera?

feedback_status = {-1, 0, 1} = some kind of metadata, not sure what tho

Sounds like should ask author on this one

CodyCBakerPhD commented 1 year ago

Looks like plenty to get started then; I would suggest a separate, modular PR adding a separate DataInterface for each column that uniquely maps onto one or more neurodata types; the velocity, position, acceleration can all be one data interface for example; the most detailed interface will probably be the photometry values, which will add those various ROI response series (DFF traces) and their constituent links

Also feel free to get started on a basic NWBConverter that fills in all the NWB file and Subject metadata for a single session

As always let me know if (or when) you have any questions~

pauladkisson commented 1 year ago

Inconsistencies with Kinematic Data

I was trying to replicate Figure 1d from the base kinematic variables (x, y, height, angle) and I discovered some inconsistencies between the figure panels, the data, and how they are described in the methods.

In the methods (Computing 2D and 3D velocity), they state "Then, the velocity was computed from the difference in position between every 2 frames and divided by 2 (to provide a smoother estimate of velocity)." But the velocity_2d_mm field in the dataframe is actually just the 1-diff of x and y i.e.
- velocity_2d_mm = sqrt( centroid_x_mm.diff(1)**2 + centroid_y_mm.diff(1)**2 )
acceleration_2d_mm is the 2-diff of velocity_2d_mm (i.e. acceleration_2d_mm = velocity_2d_mm.diff(2)), making its units mm/(2 * frame^2), but in Figure 1d, the acceleration shown is acceleration_2d_mm * fs instead of acceleration_2d_mm * fs**2 / 2 like it should be to properly convert mm/s^2.
Similarly, velocity_angle is the 2-diff of angle_unwrapped (and flipped by -1 for some reason), but that isn't taken into account in Figure 1d.

I haven't fully explored all the downstream analysis, so I'm not sure how they use these kinematic variables, but based on the notebook that plots Figure 1d, I would guess there will be divergences like these in how they deal with units mm/(n*frame^x) --> mm/s^x.

@CodyCBakerPhD, how do you think we should deal with this in NWB file? The ideas I have are to

Blindly store all the kinematic variables in the dataframe and make a note of the units (mm/frame, mm/2-frame^2, etc.)
Correct the units of the kinematic variables to SI (m/s, m/s^2, etc.) and potentially alter downstream analysis (or correct downstream analysis?)
Insist on data that is more raw ex. depth video and just store that

CodyCBakerPhD commented 1 year ago

1/2. m/s seems like the best units to write to NWB. It's a Best Practice to use SI units and 'frame' is kind of like 'pixel'; it's acceptable if there is no known way to convert it to scientific units, but it seems in this case there is.

We should also have the raw video for the position tracking, yes. That would enable finer grain pose tracking in the future if someone wanted to

pauladkisson commented 1 year ago

timestamp = time series for each 30min session in seconds (0-1800s)

Can you elaborate on this? Is it an array from 0 to 1800? Is it equivalent to np.arange(0, 1800)?

Almost but not quite. For example, the example session (used in Figure 1d) starts with a timetamp = 3.333375s, has some NaNs, as well as some small deviations from the stated sampling frequency, and a few entries that skip by 2dt.

pauladkisson commented 1 year ago

area = brain region = 'dls' or 'dms'

Fine to use w/e convention they recorded, but curious if these are actual Allen Brain Atlas references or if they use their own convention or another atlas; see what you can find, otherwise we'll just ask

They have exact stereotactic coordinates (AP, ML, DV) for the DLS implantation in the methods (Stereotaxic surgery for open field photometric recordings).

pauladkisson commented 1 year ago

session_number = number to describe the order of the session = -21, -20, ..., 0 (not sure why it counts from negative)

Does the order map perfectly to what you would get if you ordered each by the session start time?

It does for the non-stimulated mice (dls-dlight-1, dls-dlight-2, ..., dms-dlight-1, ...). But it does not for the optogenetically stimulated mice (dlight-chrimson-1, ...). For those mice the session numbers repeat every 4-6 sessions ex. (-1, 0, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, ...)

pauladkisson commented 1 year ago

centroid_x_mm = centroid x-location of the mouse estimated by OpenCV findcountours

Are you able to reproduce it exactly with the OpenCV function? If so can you note what version of OpenCV just for super clear provenance of reproducibility?

I don't have access to the raw data to be able to check this. The note about OpenCV findcontours is using a quote from the methods of the paper: Methods: Pre-Processing "Next, the location of the mouse was identified by finding the centroid of the contour with the largest area using the OpenCV findcontours function."

pauladkisson commented 1 year ago

Anything that doesn't match the paper should be confirmed and brought to authors attention; were you able to run the notebook to generate the figure in this way? Or otherwise see anything in the notebook that might explain the discrepency?

Yes, I have been using one of the notebook files in their repo: notebook_panels/fig_1d-dLight example.ipynb. I can show you tomorrow in our meeting in more detail.

pauladkisson commented 1 year ago

Interesting, they didn't track any key points on the subject other than the centroid? Normally pose estimation involves more detailed skeletons with several nodes and optional edges between them

They definitely did track keypoints (See extended data Fig. 3). And there's even a keypoints_raw_data folder in the data repo. But everything inside is Matlab .p files, which are generally like executables (I think?), so not really accessible. Probably something I need to look into tho.

CodyCBakerPhD commented 1 year ago

They have exact stereotactic coordinates (AP, ML, DV) for the DLS implantation in the methods (Stereotaxic surgery for open field photometric recordings).

Oh, I meant the acronym (the abbreviation for the brain area), to which my question still holds (are they Allen Mouse conventions)?

For exact coordinates, are they CCFv3?

Almost but not quite. For example, the example session (used in Figure 1d) starts with a timetamp = 3.333375s, has some NaNs, as well as some small deviations from the stated sampling frequency, and a few entries that skip by 2dt.

Which dt is that? That is, what data stream are these timestamps for then? Since they are irregular we will definitely want to use them

It does for the non-stimulated mice (dls-dlight-1, dls-dlight-2, ..., dms-dlight-1, ...). But it does not for the optogenetically stimulated mice (dlight-chrimson-1, ...). For those mice the session numbers repeat every 4-6 sessions ex. (-1, 0, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, ...)

Might be something to ask about then

pauladkisson commented 1 year ago

This issue is getting crowded, so moving future experiment/data notes to separate issues (see #11 )

catalystneuro / datta-lab-to-nwb

Initial Inspection #1

Overview

Miscellaneous Notes

Folder/File Descriptions

dlight_intermediate_results

3min_max_dlight_0-0.3s_vs_usage_shuffled_agg.parquet

dlight-chrimson_snippets_offline_features.parquet

lagged_analysis_session_bins.toml

performance-prediction-error-distances-and-dopamine.parquet

syllable_stats_photometry_offline.toml

syllable-classifier-from-dlight-amplitude-submission

dlight

shuffle

syllable-classifier-submission

dlight_raw_data

3s-pulsed-stim-dataframe.parquet

dlight_photometry_processed_full_transfer.parquet

dlight_photometry_processed_full.toml

hek_raw_data

keypoints_raw_data

miscellaneous_intermediate_results

autoencoder_characterization.parquet

misc_raw_data

autoencoder_test_data.h5

f1_scores_estimates_actual_calls.parquet

latencies_stim_arduino_test.dat

latencies_stim.parquet

optoda_intermediate_results

behavioral_classes.toml

behavioral-distance.parquet

closed_loop_learners.toml

da-vs-learning-per-syllable.parquet

joint_syllable_map.toml

syllable_stats_offline.toml AND syllable_stats_offline.toml

optoda_raw_data

closed_loop_behavior_transfer.parquet

closed_loop_behavior_velocity_conditioned.parquet

closed_loop_behavior_with_simulated_triggers_transfer.parquet

learning_aggregate.parquet

learning_timecourse_binsize-30.parquet

learning_timecourse_processed_summary.parquet

learning_timecourse_processed.parquet

realtime_package

rl_intermediate_results

rl_model_heldout_results_best_lag_rands.parquet

rl_model_heldout_results_lags.parquet

rl_model_parameters.toml AND rl_model_stats.toml

rl_raw_data

rl_modeling_dlight_data_offline.parquet

rl_modeling_dlight_data_online.parquet

TODOs

Hek

Photometry

Optogenetic

87 Total Columns split into semantic groups

Syllable-related Columns

PCs

Session-related Columns

Subject-related Columns

Trigger-related Columns

Kinematic Data Columns

Photometry Columns

Miscellaneous

Syllable

PCA

Session-level

Subject-level

Trigger-related

Photometry

Kinematic

Misc

Inconsistencies with Kinematic Data