VBN session 1140102579 partially missing running data

egmcbride commented 2 years ago

Describe the bug VBN ecephys session 1140102579 appears to be missing some running data.

To Reproduce

from pynwb import NWBHDF5IO from allensdk.brain_observatory.ecephys.behavior_ecephys_session import ( BehaviorEcephysSession) import numpy as np import pandas as pd from matplotlib import pyplot as plt

nwb_path='\\allen\programs\mindscope\workgroups\np-behavior\vbn_data_release\nwbs_220429\ecephys_session_1140102579.nwb'

with NWBHDF5IO(nwb_path, 'r', load_namespaces=True) as nwb_io: session = BehaviorEcephysSession.from_nwb(nwbfile=nwb_io.read())

running_speed = session.running_speed

fig,ax=plt.subplots(1,1) ax.plot(running_speed['timestamps'],running_speed['speed'])

Expected behavior Running data to plot for the entire session

Actual Behavior Large chunk of running data is missing, starting at the end of the active session around 3640 seconds

Environment (please complete the following information):

OS & version: Windows 10
Python version 3.8
AllenSDK version vbn_2022_dev branch

danielsf commented 2 years ago

@egmcbride I think the problem may be in the sync file.

Here is a plot of all of the rising edges in the vsync_stim line recorded in that sync file. The horizontal axis is just a running count of the number of rising edges. The vertical axis is the time of the edge in seconds. The dashed red lines show the breaks (in n_frames) between the behavior, mapping, and replay stimulus blocks vsync_stim_rising_edges

The jump in timesteps that you observed in the running speed data is pretty clear there. I'll look more closely at the sync file to see what could possibly be wrong with it. I don't have a lot of experience with sync files being the root of our problem.

Here is the python code that generated the plot above

import matplotlib.figure as mfig
import pandas as pd
from allensdk.brain_observatory.sync_dataset import Dataset as SyncDataset

behavior_pkl_path =  "/allen/programs/braintv/production/visualbehavior/prod0/specimen_1115936371/ecephys_session_1140102579/11402568
49/1140102579_585329_20211111.behavior.pkl"
mapping_pkl_path =  "/allen/programs/braintv/production/visualbehavior/prod0/specimen_1115936371/ecephys_session_1140102579/114025684
9/1140102579_585329_20211111.mapping.pkl"
replay_pkl_path =  "/allen/programs/braintv/production/visualbehavior/prod0/specimen_1115936371/ecephys_session_1140102579/1140256849
/1140102579_585329_20211111.replay.pkl"

sync_file_path = "/allen/programs/braintv/production/visualbehavior/prod0/specimen_1115936371/ecephys_session_1140102579/1140256849/1
140102579_585329_20211111.sync"

with SyncDataset(sync_file_path) as sync_dataset:
    timestamps = sync_dataset.get_rising_edges('vsync_stim', units='seconds')

fig = mfig.Figure()
axis = fig.add_subplot(1,1,1)

axis.plot(range(len(timestamps)), timestamps)
axis.set_xlabel('n')
axis.set_ylabel('t (seconds)')

title = 'vsync_stim rising edges from\n'
dl = 40
for ii in range(0, len(sync_file_path), dl):
    title += f'{sync_file_path[ii:ii+dl]}\n'
axis.set_title(title, verticalalignment='top', horizontalalignment='left',
position=(0,100))

n0 = 0
for pkl in (behavior_pkl_path,
            mapping_pkl_path,
            replay_pkl_path):

    data = pd.read_pickle(pkl)
    if pkl == behavior_pkl_path:
        n_frames = len(data['items']['behavior']['intervalsms'])
    else:
        n_frames = len(data['intervalsms'])
    n_frames += 1
    n0 += n_frames
    axis.axvline(n0, color='r', linestyle='--', linewidth=2)

fig.tight_layout()
fig.savefig('vsync_stim_rising_edges.png')

danielsf commented 2 years ago

I've looked a little more closely at the sync file. It looks like the vsync_stim line is only "hot" 91 times between sync timesteps 4345152 and 7869867

i.e.

>>> oddity0 = 4345152
>>> oddity1 = 7869867
>>> sync_file_path = sync_file_path = "/allen/programs/braintv/production/visualbehavior/prod0/specimen_1115936371/ecephys_session_1140102579/1140256849/1140102579_585329_20211111.sync"
>>> with h5py.File(sync_file_path, 'r') as in_file:
...     data = in_file['data'][()]
... 
>>> ((data[oddity0:oddity1, 1]&4)>0).sum()
91
>>> ((data[oddity1:oddity1+oddity1-oddity0, 1]&4)>0).sum()
3156899
>>> ((data[oddity0-oddity1+oddity0:oddity0, 1]&4)>0).sum()
3214510
>>>

those last two sums are meant to show that in the 300,000 sync steps preceeding and following this odd gap, the sync line is on much more frequently

danielsf commented 2 years ago

Here is a plot of the vsync_stim line as a function of "time" (really just the row index in the sync file's data array) for four random sync files and our problem sync file. There is clearly a gap. compare_many_stims

danielsf commented 2 years ago

For what it's worth: here are the start_times and stop_times for the sessions plotted above as recorded in the sync file metadata. The problem session is recorded as being about an hour longer than the other four

session 1140102579: 2021-11-11 14:08:48.989912 -- 2021-11-11 17:43:50.997690
session 1095340643: 2021-04-08 14:00:43.749578 -- 2021-04-08 16:39:10.013137
session 1115086689: 2021-07-13 16:11:28.214809 -- 2021-07-13 18:50:45.838626
session 1113751921: 2021-07-07 13:44:33.700674 -- 2021-07-07 16:23:07.031915
session 1052530003: 2020-09-24 14:34:05.395634 -- 2020-09-24 17:18:52.758395

corbennett commented 2 years ago

Looking into this session a bit more, I discovered that the mapping stimulus didn't start as expected after the behavior stimulus. There was a long delay between the two, which explains the gap in vsyncs/running data. So this isn't a problem with the sync file or the nwb file, but an experiment error.

In general, the running data for these gaps (even the typical short duration ones between stimuli) should be nans, since we don't actually measure the speed between the stimuli.

danielsf commented 2 years ago

@corbennett Currently, the code just has no timestamps or running speeds at all for times when running speed is not recorded, i.e. this session would have something like

timestamps = [..., 3999.0, 4000.0, 6000.0, 6001.0, ...]
velocity = [.., 0.11, 0.22, 0.33, 0.44, ...]

Do you want us to change it to

timestamps = [..., 3999.0, 4000.0, 4001.0, 4002.0,..., 5999.0, 6000.0, 6001.0, ...]
velocity = [.., 0.11, 0.22, np.NaN, np.NaN, ..., np.NaN, 0.33, 0.44, ...]

?

Do we need to sample time at (approximately) the same cadence as the actual data, or is it enough to have a single np.NaN at the start and end of every gap?

danielsf commented 2 years ago

Confirmed with Corbett that the running speed data object is fine as is. Marking "won't fix."

AllenInstitute / AllenSDK

VBN session 1140102579 partially missing running data #2425