Closed pauladkisson closed 2 months ago
Focusing on dlight_photometry_processed_full_transfer.parquet
today
@pauladkisson Can you post a full list of the 87 column names and we can start putting together a list of exclusions (for determined duplicates and other things we don't want to make it to NWB) as well as finalize how to map the remaining important columns to NWB neurodata types
Summary of 3 identified experiments thus far
Initial imaging of the area of interest, just a few microscopy images
All data stored in dlight_photometry_processed_full_transfer.parquet
Many sessions and many subjects, but thankfully a UUID mapping provided in the corresponding TOML
All data in 3s-pulsed-stim-dataframe.parquet
, will focus on this one after the photometry
Digging into the dlight_raw_data/dlight_photometry_processed_full.parquet file:
Conclusion: Just use "predicted_syllable (offline)" since it's the one shown in figure 1d.
Conclusion: Since it's a derived data stream we should omit from NWB file
start_time
and StartTime
are identical to date
but less complete (some NaNs) --> stick with date
session_name
and SessionName
do not overlap i.e. all entries with a valid session_name
have None
for SessionName
and vice versaConclusion: all this info is available in the mouse_id column --> that's what I'll use
Conclusion: optogenetic triggering info seems to be an artifact of the aggregation process of the 8 opto-da mice with the 6 pure FP mice --> ignore for now until we get to the opto-da experiment data.
timestamp
= sqrt( centroid_x_mm.diff()**2 + centroid_y_mm.diff()**2 )
= sqrt( centroid_x_mm.diff()**2 + centroid_y_mm.diff()**2 + height_ave_mm.diff()**2 )
= velocity_2d_mm.diff(2)
velocity_angle != angle_unwrapped.diff(2)
velocity_angle = angle_unwrapped.diff(2) * -1
height_ave_mm.diff(2) * 30 / 2
(average 2-frame height velocity) but it's not equal to velocity_height
in the dataframe... Conclusion: just keep centroid_x_mm, centroid_y_mm, height_ave_mm, and angle_unwrapped since the other variables can be reconstructed from those 4.
dlight_reref = signal_dff - uv_reference_fit
filter_params
Conclusion: Need to keep signal_dff, reference_dff, uv_reference_fit, reference_dff_fit, and summary/metadata fields: filter_params, signal_max, reference_max, signal_reference_corr, snr. All other columns can easily be derived from these ones.
Conclusion: Pack up useful metadata like fs and ignore all opto-da stuff for now
@pauladkisson This is great; I'll go over this in detail later today and provide next steps for assembling the conversion
The basic idea will be to make one data interface for each data stream, and put them together in a NWBConverter
This project reminds me of how the IBL structured the access to their data; that good reference for how the final product should look is then
Notably, in each data interface in the NWBConverter I attached a certain object as an attribute during __init__
- that object was the thing analogous to a single large table, so something you will need here is to attach similar information about which columns of the table need to be loaded and which rows also correspond to one of the mouse IDs (we usually do one NWB file per subject)
Conclusion: Just use "predicted_syllable (offline)" since it's the one shown in figure 1d.
Sounds good
Conclusion: Since it's a derived data stream we should omit from NWB file
Yep
date = full datetime (YYYY-MM-DD HH:MM:SS) of the start of each experimental session
This shall serve as the session_start_time
for each NWB file then
timestamp = time series for each 30min session in seconds (0-1800s)
Can you elaborate on this? Is it an array from 0 to 1800? Is it equivalent to np.arange(0, 1800)
?
uuid = unique identifier for each session (apparent from Figure 1d notebook and .toml file)
Cool, similar to IBL this will serve as our session_id
unique_idx = integer index for each uuid
Better to use the UUID if this is a one-to-one mapping
session_name/SessionName = description of type of session ex. "Recording Session 1", "Stim session 2", etc. session descriptions are not unique and have typo duplicates such as "Habituation 1", "habituation 1", "habituation 1 " etc.
If you can make a condensed mapping of the roughly 'unique' descriptors from the overlap of the two fields, we can include it in the session_description
or to provide more context
session_number = number to describe the order of the session = -21, -20, ..., 0 (not sure why it counts from negative)
Does the order map perfectly to what you would get if you ordered each by the session start time?
session_repeat = 0 for the first session of the day, 1 for the second session
Hmm... seen this before, so some precedence for indicating the number on the day. OK with including this in the session description along with the other annotation
session_total_number = ???
Probably for getting context of run # per day? like, number '2 out of 3' on the day
stim_session = a number for each day of recording = -27, -27, -26, -26, ... (not sure why it counts from negative)
Sounds like a good one to ask about, might be relevant to something (or might not be)
mouse_id/subject_name/SubjectName
Yep, as agreed from the meeting just use mouse_id
area = brain region = 'dls' or 'dms'
Fine to use w/e convention they recorded, but curious if these are actual Allen Brain Atlas references or if they use their own convention or another atlas; see what you can find, otherwise we'll just ask
opsin = 'chrimson' if mouse implanted with optogenetic stimulator, otherwise 'n/a'
Hm.. this one is interesting... normally we'd use the indicator
field of an ImagingPlane
, but that's for optical imaging
Since this is ogen/photometry I guess it would have to be some metadata annotated at the subject level, which is why they have it here instead
genotype = area + opsin = {None, 'dls-chrimson-dlight', 'dls-dlight', 'dms-dlight'}
In general you can check the NWB schema for Subjects, though actually the docstring for the PyNWB type might be more useful
Looks like this ought to map directly onto the NWB genotype
field, and would include the opsin info above so then the opsin could be ignored in the NWB mapping
Conclusion: optogenetic triggering info seems to be an artifact of the aggregation process of the 8 opto-da mice with the 6 pure FP mice --> ignore for now until we get to the opto-da experiment data.
Sounds fair given what you've summarized
Mostly looks like metadata about ogen itself and annotation around it, would need the raw ogen data plus Q/A with authors to understand more
signal_dff = dlight emission signal (delta F / F0 from blue light component) green signal in figure 1d reference_dff = dlight reference signal (from isosbestic UV component) grey control in figure 1d uv_reference_fit = smoothed UV reference signal (see Methods: Photometry active referencing) dlight_reref = dlight emission signal normalized by uv_reference fit dlight_reref = signal_dff - uv_reference_fit
So interesting to see how different labs use photometry...
This is much more like an ophys treatment (well, segmented ROIs anyway)
We normally see these referred to as baseline
(the 'reference_dff'/'reference_fit' here) vs. detrended
(the 'reref' here; the 'signal_dff' would be recoverable given both of those so deemed redundant to include all 3), though we often see it applied to flourescence directly rather than the delta'd flourescence, but it ought to be a distributive property so order of operations shouldn't matter
Anyway these look to fit the parameters of the photometry extension pretty well (different response series for each)
Conclusion: Need to keep signal_dff, reference_dff, uv_reference_fit, reference_dff_fit, and summary/metadata fields: filter_params, signal_max, reference_max, signal_reference_corr, snr. All other columns can easily be derived from these ones.
Yep, sounds good enough to prototype
Which set of experiments did this data correspond to? (or was it both?)
camera_timestamp = time stamp recorded by depth camera --> redundant with timestamp
Can you explain this structure a bit more? Is it a vector of timing information, roughly ~30 FPS? Is it irregular? Do you have it for each session?
centroid_x_mm = centroid x-location of the mouse estimated by OpenCV findcountours
Are you able to reproduce it exactly with the OpenCV function? If so can you note what version of OpenCV just for super clear provenance of reproducibility?
velocity_2d_mm = velocity in 2d (x and y) velocity is measured in mm/frame so it needs to be corrected by the sampling rate (30) = sqrt( centroid_x_mm.diff()2 + centroid_y_mm.diff()2 ) velocity_3d_mm = same as above except velocity measures changes in height as well as x-y position = sqrt( centroid_x_mm.diff()2 + centroid_y_mm.diff()2 + height_ave_mm.diff()**2 ) acceleration_2d_mm = acceleration in 2d (x and y), also needs to be correct by sampling rate acceleration_3d_mm = acceleration in 3d (x-y + height) jerk_2d_mm and jerk_3d_mm = same as above for jerk
height_ave_mm = height above the floor in mm
Height will be included as an additional axis on the x/y SpatialSeries
(which should be in a Position
container)
They didn't estimate the third axis in the velocity and acceleration values? Not that the mouse would be jumping around or anything lol, just curious
I guess that would explain why it's a separate column altogether...
angle and angle unwrapped = should be the orientation of the mouse: angle_unwrapped = angle in radians but drifts throughout the experiment angle = angle in radians but has weird discontinuities (not restricted to the range 0-2pi) using angle_unwrapped seems to be a better option (and tracks better with velocity_angle)
The official Best Practice for data wrapping (OK, OK, it technically only applies to SpatialSeries
but this is a sister data type of that)
Note however that Best Practices are not super hard-and-fast rules for the most part; that one especially comes down to whichever form you think is more useful in the context of understanding the paper or the dataset
So I think it would be fine to use the unwrapped version as you indicate
velocity_angle = angular velocity in rad/s (with correction needed ) matches figure 1d BUT velocity_angle != height_ave_mm.diff() 30 instead it is slightly different and flipped by a factor of (-1)... velocity_height = height velocity in mm/s (with correction needed) BUT not consistent with Figure 1d height velocity from figure 1d = height_ave_mm.diff(2) 30 / 2 (average 2-frame height velocity) but it's not equal to velocity_height in the dataframe...
Anything that doesn't match the paper should be confirmed and brought to authors attention; were you able to run the notebook to generate the figure in this way? Or otherwise see anything in the notebook that might explain the discrepency?
width_mm and length_mm = ??? maybe the width of the bounding box for the mice, but they fluctuate a lot: length = (21, 75), width = (18, 33) for just one session
Might be good to confirm with the author; bounding information could be useful metadata to annotate
movement_initations = moments when mice transitioned from stillness to motion representing by an incrementing variable see Methods: Movement initiation analyses
This could be useful to keep as events; see the ndx-events extension
Conclusion: just keep centroid_x_mm, centroid_y_mm, height_ave_mm, and angle_unwrapped since the other variables can be reconstructed from those 4.
Interesting, they didn't track any key points on the subject other than the centroid? Normally pose estimation involves more detailed skeletons with several nodes and optional edges between them
While the velocity and acceleration can technically be calculated from the position we do often just include it in the file as the TimeSeries
Ben mentioned for visualization purposes (sorry if that's confusing against our rule of no 'derived' data; it's more like, 'some' derived data; more precisely it comes down to how many and if any hyper-parameters went into calculating the derived data)
Remind me though, these are all columns of the dlight_raw_photometry
file right? Trying to understand why there's ogen related stuff in here (or did they mix ogen and photom for some sessions?)
fs = sampling_frequency = 30Hz
sampling frequency of what though? That's ecephys electrode level resolution
stim_frequency = optogenetic stimulation frequency = 25Hz stim_duration = optogenetic stimulation duration = {0.25, 2.0, 3.0} seconds pulse_width = 0.005 seconds = 5ms
Sounds like useful ogen metadata
frame_index = index of timestamp
Man, this is where I get really confused again with the timestamp
field, which I thought now was the timestamps of the behavior camera?
feedback_status = {-1, 0, 1} = some kind of metadata, not sure what tho
Sounds like should ask author on this one
Looks like plenty to get started then; I would suggest a separate, modular PR adding a separate DataInterface
for each column that uniquely maps onto one or more neurodata types; the velocity, position, acceleration can all be one data interface for example; the most detailed interface will probably be the photometry values, which will add those various ROI response series (DFF traces) and their constituent links
Also feel free to get started on a basic NWBConverter
that fills in all the NWB file and Subject metadata for a single session
As always let me know if (or when) you have any questions~
I was trying to replicate Figure 1d from the base kinematic variables (x, y, height, angle) and I discovered some inconsistencies between the figure panels, the data, and how they are described in the methods.
velocity_2d_mm
field in the dataframe is actually just the 1-diff of x and y i.e.
velocity_2d_mm = sqrt( centroid_x_mm.diff(1)**2 + centroid_y_mm.diff(1)**2 )
acceleration_2d_mm = velocity_2d_mm.diff(2)
), making its units mm/(2 * frame^2)
, but in Figure 1d, the acceleration shown is acceleration_2d_mm * fs
instead of acceleration_2d_mm * fs**2 / 2
like it should be to properly convert mm/s^2.velocity_angle
is the 2-diff of angle_unwrapped
(and flipped by -1 for some reason), but that isn't taken into account in Figure 1d.I haven't fully explored all the downstream analysis, so I'm not sure how they use these kinematic variables, but based on the notebook that plots Figure 1d, I would guess there will be divergences like these in how they deal with units mm/(n*frame^x) --> mm/s^x.
@CodyCBakerPhD, how do you think we should deal with this in NWB file? The ideas I have are to
1/2. m/s
seems like the best units to write to NWB. It's a Best Practice to use SI units and 'frame' is kind of like 'pixel'; it's acceptable if there is no known way to convert it to scientific units, but it seems in this case there is.
timestamp = time series for each 30min session in seconds (0-1800s)
Can you elaborate on this? Is it an array from 0 to 1800? Is it equivalent to np.arange(0, 1800)?
Almost but not quite. For example, the example session (used in Figure 1d) starts with a timetamp = 3.333375s, has some NaNs, as well as some small deviations from the stated sampling frequency, and a few entries that skip by 2dt.
area = brain region = 'dls' or 'dms'
Fine to use w/e convention they recorded, but curious if these are actual Allen Brain Atlas references or if they use their own convention or another atlas; see what you can find, otherwise we'll just ask
They have exact stereotactic coordinates (AP, ML, DV) for the DLS implantation in the methods (Stereotaxic surgery for open field photometric recordings).
session_number = number to describe the order of the session = -21, -20, ..., 0 (not sure why it counts from negative)
Does the order map perfectly to what you would get if you ordered each by the session start time?
It does for the non-stimulated mice (dls-dlight-1, dls-dlight-2, ..., dms-dlight-1, ...). But it does not for the optogenetically stimulated mice (dlight-chrimson-1, ...). For those mice the session numbers repeat every 4-6 sessions ex. (-1, 0, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, ...)
centroid_x_mm = centroid x-location of the mouse estimated by OpenCV findcountours
Are you able to reproduce it exactly with the OpenCV function? If so can you note what version of OpenCV just for super clear provenance of reproducibility?
I don't have access to the raw data to be able to check this. The note about OpenCV findcontours is using a quote from the methods of the paper: Methods: Pre-Processing "Next, the location of the mouse was identified by finding the centroid of the contour with the largest area using the OpenCV findcontours function."
Anything that doesn't match the paper should be confirmed and brought to authors attention; were you able to run the notebook to generate the figure in this way? Or otherwise see anything in the notebook that might explain the discrepency?
Yes, I have been using one of the notebook files in their repo: notebook_panels/fig_1d-dLight example.ipynb. I can show you tomorrow in our meeting in more detail.
Interesting, they didn't track any key points on the subject other than the centroid? Normally pose estimation involves more detailed skeletons with several nodes and optional edges between them
They definitely did track keypoints (See extended data Fig. 3). And there's even a keypoints_raw_data
folder in the data repo. But everything inside is Matlab .p files, which are generally like executables (I think?), so not really accessible. Probably something I need to look into tho.
They have exact stereotactic coordinates (AP, ML, DV) for the DLS implantation in the methods (Stereotaxic surgery for open field photometric recordings).
Oh, I meant the acronym (the abbreviation for the brain area), to which my question still holds (are they Allen Mouse conventions)?
For exact coordinates, are they CCFv3?
Almost but not quite. For example, the example session (used in Figure 1d) starts with a timetamp = 3.333375s, has some NaNs, as well as some small deviations from the stated sampling frequency, and a few entries that skip by 2dt.
Which dt
is that? That is, what data stream are these timestamps for then? Since they are irregular we will definitely want to use them
It does for the non-stimulated mice (dls-dlight-1, dls-dlight-2, ..., dms-dlight-1, ...). But it does not for the optogenetically stimulated mice (dlight-chrimson-1, ...). For those mice the session numbers repeat every 4-6 sessions ex. (-1, 0, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, ...)
Might be something to ask about then
This issue is getting crowded, so moving future experiment/data notes to separate issues (see #11 )
Overview
This issue serves as a public notebook for this conversion project
Miscellaneous Notes
Folder/File Descriptions
Files are described in ascending
Name
order as they appear in the unzipped zenodo datasetdlight_intermediate_results
3min_max_dlight_0-0.3s_vs_usage_shuffled_agg.parquet
shape = (281, 1000)
[-7.00, -6.95, ..., 7.00]
= lag in ms?[0, 1, ..., 999]
= syllables?dlight-chrimson_snippets_offline_features.parquet
[0, 0, 0, ..., 0, 1, ..., 1, ..., 1844354]
= snippet in time?[signal_<xyz> = NaN, window]
lagged_analysis_session_bins.toml
session_bins = [ 0, 360, 660, 900, 1680, 1800,]
= session starts in seconds?performance-prediction-error-distances-and-dopamine.parquet
syllable_stats_photometry_offline.toml
syllable-classifier-from-dlight-amplitude-submission
dlight
accuracy
,n_classes
threshold
andtype
=dLightshuffle
syllable-classifier-submission
dlight_raw_data
3s-pulsed-stim-dataframe.parquet
dlight_photometry_processed_full_transfer.parquet
dlight_photometry_processed_full.toml
hek_raw_data
keypoints_raw_data
miscellaneous_intermediate_results
autoencoder_characterization.parquet
misc_raw_data
autoencoder_test_data.h5
f1_scores_estimates_actual_calls.parquet
latencies_stim_arduino_test.dat
latencies_stim.parquet
optoda_intermediate_results
behavioral_classes.toml
dive = [64, 65, 75, 95, 19, 71, 8]
behavioral-distance.parquet
closed_loop_learners.toml
da-vs-learning-per-syllable.parquet
joint_syllable_map.toml
syllable_stats_offline.toml AND syllable_stats_offline.toml
optoda_raw_data
closed_loop_behavior_transfer.parquet
closed_loop_behavior_velocity_conditioned.parquet
closed_loop_behavior_with_simulated_triggers_transfer.parquet
learning_aggregate.parquet
learning_timecourse_binsize-30.parquet
learning_timecourse_processed_summary.parquet
learning_timecourse_processed.parquet
realtime_package
rl_intermediate_results
rl_model_heldout_results_best_lag_rands.parquet
rl_model_heldout_results_lags.parquet
rl_model_parameters.toml AND rl_model_stats.toml
rl_raw_data
rl_modeling_dlight_data_offline.parquet
rl_modeling_dlight_data_online.parquet
TODOs
Based on initial inspection, I need the figure out the following