NeurodataWithoutBorders / pynwb

A Python API for working with Neurodata stored in the NWB Format
https://pynwb.readthedocs.io
Other
175 stars 85 forks source link

Verification: ImageSeries with list of external files & its timestamps #1318

Closed CodyCBakerPhD closed 2 years ago

CodyCBakerPhD commented 3 years ago

Hey all,

I'm trying to convert a list of .mkv files into ImageSeries without actually encoding the data, that is, via the format=external option. My concern, however, is dealing with the timestamps for each of these .mkv separately.

Attempting to do the intuitive thing, and pass timestamps as a list of arrays, each element of the list corresponding to the .mkvfile, causes an error. Obviously, I can simply concatenate all the timestamps, but I'm just trying to seek verification that that is the correct approach to take, especially in the event that there is perhaps a mismatch between the number of timestamps and the number of frames for a given file (flattening the timestamps would make it much harder to debug such as issue).

Here is some simple code to replicate the issue, granted you'll have to find some example .mkv/.csv files to test it yourself.

from pynwb import NWBFile
import uuid
from pathlib import Path
from datetime import datetime
import pandas as pd
import numpy as np
from pynwb.image import ImageSeries

nwbfile = NWBFile(
    session_description="Replicating ImageSeries issue.",
    identifier=str(uuid.uuid4()),
    session_start_time=datetime.now()
)

video_folder = Path("D:/ExampleImageSeriesIssue")
video_file_path_list = [str(x.absolute()) for x in video_folder.iterdir() if x.suffix == ".mkv"]

video_timestamps = list()
for video_file_path in video_file_path_list:
    video_time_file = pd.read_csv(video_file_path.replace(".mkv", "_timestamps.csv"), header=0)
    video_timestamps.append(np.array([int(x.split(";")[1]) for x in video_time_file['frame; timestamp']]))

# Encounters an error
try:
    videos1 = ImageSeries(
        name='Videos',
        description="Videos recorded by TIS camera.",
        format="external",
        external_file=video_file_path_list,
        timestamps=video_timestamps
    )
    nwbfile.add_acquisition(videos1)
except ValueError:
    print("For Testing: attempting to make the ImageSeries returned "
          "ValueError: ImageSeries.__init__: incorrect shape for 'timestamps' (got '(2, 57409)', expected '(None,)')")

# Doesn't fail, but also can't check if the timestamps align with number of frames in external file
videos2 = ImageSeries(
    name='Videos',
    description="Videos recorded by TIS camera.",
    format="external",
    external_file=video_file_path_list,
    timestamps=np.concatenate(video_timestamps)
)
nwbfile.add_acquisition(videos2)

cc @bendichter

oruebel commented 2 years ago

Obviously, I can simply concatenate all the timestamps, but I'm just trying to seek verification that that is the correct approach to take,

Looking at the schema, I believe that is indeed the correct approach.

especially in the event that there is perhaps a mismatch between the number of timestamps and the number of frames for a given file (flattening the timestamps would make it much harder to debug such as issue).

When using external_file I believe you should always also supply the starting_frame parameter, which is missing in the example above. The starting_frame serves as an index to indicate which frames each file contains, to facilitate random access. See the here for details on how to us it: https://github.com/NeurodataWithoutBorders/nwb-schema/blob/761a0d7838304864643f8bc3ab88c93bfd437f2a/core/nwb.image.yaml#L88-L104 . With the combination of starting_frame plus timestamps should also help with debugging.

Looking at the current implementation of ImageSeries in PyNWB, I think we should probably add a check to ImageSeries.__init__ to a) require that starting_frame is provided if external_file is used and b) that starting_frame has the same length as external_file. It looks like currently starting_frame is set to [0] by default, which is fine if you only have a single external file but is incorrect for multiple external files. I would suggest to create a separate issue for this.

weiglszonja commented 2 years ago

@oruebel, I can take on this task, also created an issue for it.

CodyCBakerPhD commented 2 years ago

@oruebel Thanks for the clarification!

@weiglszonja has raised fix #1510 to ensure that starting_frame structure is indeed followed.

oruebel commented 2 years ago

I can take on this task, also created an issue for it.

Thanks @weiglszonja for your help with addressing this issue