NeurodataWithoutBorders / nwb-schema

Data format specification schema for the NWB neurophysiology data format
http://nwb-schema.readthedocs.io
Other
53 stars 16 forks source link

modify timestamps specification to allow storing of different time formats #49

Closed ajtritt closed 6 years ago

ajtritt commented 7 years ago

Users would like to store timestamps in both absolute and relative time. To achieve this, the following modification is proposed:

Proposal Add an additional top-level field called _timestampsoffset - the time difference by which timestamps are offset from being relative to the file's global starting time i.e. session_start_time. This value is 0 by default i.e. timestamps are directly relative to session_start_time. If this value is not 0, then to obtain timestamps that are directly relative to session_start_time, one must subtract this offset.

Advantages

Disadvantages

Modifications

neuromusic commented 7 years ago

doesn't this also necessitate a units field and expanding the allowed dtype?

neuromusic commented 7 years ago

"since this will explicitly specify the offset to obtain relative timestamps, eliminating the need for APIs to guess based on range." can you clarify this point?

instead of adding a new field, wouldn't it be more simple to deprecate session_start_time and replace it with timestamps_reference_time as an optional top-level field? however, if people want to use POSIX timestamps (the use case that motivated this), this still necessitates expanding the dtype to include int64 & adding a units field for timestamps, no?

nicain commented 7 years ago

Fwiw, I think that every dimension should have a required unit spec, including time

ajtritt commented 7 years ago

@neuromusic timestamps already has a unit field. Right now it is sent to the constant 'Seconds'. That would need to be changed to support this proposal. We would need to allow ints as well.

Getting rid of session_start_time does not preserve backwards compatibility. By adding this field, we preserve the theme of having timestamps relative to a global start.

Regarding my statement about API's guessing.... storing the offset provides a mechanism for explicitly specifying what stored timestamps mean. I am trying to strike the balance of keeping relative timestamps as default, but allowing users to specifying absolute time. By specifying timestamps_offset, users can store absolute time, which they may want to do queries or analyses on. However, this breaks the concept of storing timestamps relative to a start time. By storing timestamps_offset, a user is effectively saying "subtract this number from all timestamps to get timestamps relative to the global start time". By explicitly storing this number, we avoid the situation where a users or API has to inspect timestamps to determine which frame they are in. Under the rules I have laid out, It will always be the case that timestamps - _timestampsoffset will always produce timestamps that are relative to session_start_time.

neuromusic commented 7 years ago

"Under the rules I have laid out, It will always be the case that timestamps - timestamps_offset will always produce timestamps that are relative to session_start_time."

Currently, timestamps will always produce timestamps that are relative to session_start_time. This proposal adds the overhead of an additional field to resolve this data.

We are already in the middle backwards compatibility breaking changes to the schema, no? ElectrodeTable, etc. I'm happy to break backwards compatibility for increased clarity and generalization.

And to clarify, all timestamps are relative, always. The need here is not absolute timestamps per se but timestamps relative to an explicit session-agnostic date.

my updated proposal:

  1. we keep session_start_time
  2. we add timestamps_reference_time as an optional top-level field, which all timestamps in the file are referenced to

this has the added benefit that a user using POSIX timestamps does not need to compute the delta between Jan 1970 and session_start_time simply to write a file, as they would need to in your proposal.... they can just drop the POSIX reference time into the relevant field and put an experimentally-relevant value into session_start_time

ajtritt commented 7 years ago

I am a little unclear about what you are proposing. Can you tell me how one would store posix timestamps using this scheme?

neuromusic commented 7 years ago

minimal example

session_start_time = None
timestamps_reference_time = None

# then, for a given timeseries...
timestamps = <array of float64>
timestamps.dtype = float64
timestamps.unit = 'second'

POSIX example

session_start_time = None
timestamps_reference_time = '1970-01-01T00:00:00Z'

# then for a given TimeSeries....

timestamps = <an array of int64>
timestamps.datatype = int64
timestamps.unit = 'POSIX64'

non-POSIX, but "absolute" reference which is different from session start

session_start_time = '2017-08-31T10:40:01-07:00'
timestamps_reference_time = '2017-08-07T00:00:00Z'

# then for a given TimeSeries....

timestamps = <an array of float64>
timestamps.datatype = float64
timestamps.unit = 'second'

in this case, even though we designated that the session started on 8/31 using local time, all timestamps are relative to GMT at an earlier date (perhaps the animal's birth date? or first day of training? or some other experimentally-relevant time that the experimenter wants to reference their timestamps to)

oruebel commented 7 years ago

In the current format, time encodings that are not relative to the global clock would be stored in the sync group of TimeSeries http://nwb-schema.readthedocs.io/en/latest/format.html#id94 . As such, I think it might make sense to define these structures within sync. In this way TimeSeries itself would not change (i.e.., always have relative times) and a specific spec would be added to the sync folder to describe absolute times. To make the sync group easier to extend it would probably be useful to add a neurodata_type: LabTime group that people can extend to describe lab-specific time encodings.

neuromusic commented 7 years ago

@oruebel I don't understand. sync addresses a totally separate issue of tracking provenance information related to synchronizing timestamps between timeseries acquired from different hardware. AFTER sync'ing you need to put everything in the same time reference, which what this issue is concerned with.

in my proposal, all timestamps are still relative to the global clock, we're just being more explicit about the the global clock's reference.

ajtritt commented 7 years ago

@neuromusic Thanks for clarifying. So then session_start_time and timestamps_reference_time are both optional, and there are three rules for how they get interpreted?

neuromusic commented 7 years ago

yes, both optional.

three rules?

ajtritt commented 7 years ago

Yeah, the three scenarios you listed and their meanings

neuromusic commented 7 years ago

rule 1: session_start_time is metadata. while machine-readable, the specification offers no guidance on how it should be interpreted beyond what the experimenter describes, however, by convention, this should be the datetime of the experiment. rule 2: timestamps_reference_time is interpreted as the ISO8601 time where timestamp=0. rule 3: if timestamps_refernece_time is empty, then the experiment has no ISO8601 reference & its timestamps can only be referenced internally to the NWB file

n.b. I keep saying "file" but I know we want to break free from HDF5 eventually... should I say "container"? "session"?

oruebel commented 7 years ago

What I like about Andrews proposal is that it does not unnecessarily break backwards compatibility. If a user does not specify timestamps_offset then all times are relative to session_start_time as they have always been. It only adds additional functionality without breaking existing functionality.

Refining the specification of session_start_time in terms of defining exactly how it should be stored is I think in general helpful and has been raised in #50

ajtritt commented 7 years ago

After thinking about this and doing some math, these two proposals are one and the same in terms of how they get used. Here's my explanation for why I think that...

To explain this, let me define some short hands I will use below. abs := absolute timestamps rel := relative timestamps ts_off = timestamps_offset trt := timestamps_reference_time sst := session_start_time ts := timestamps POSIX(xyz) := the posix time representation of an ISO8601 encoded timestamps

Proposal 1: store offset of timestamps in field timestamps_offset

abs = ts + sst - ts_off
rel = ts - ts_off

If a user has saved absolute timestamps i.e. ts == abs and ts_off == POSIX(sst), then

rel = ts - tsoff

If a users has saved relative timestamps i.e. ts == rel and ts_off == 0, then

abs = ts + sst

Proposal 2: store reference time for timestamps in field timestamps_reference_time

abs = ts + trt
rel = ts + trt - sst

If a users has saved absolute timestamps i.e. ts == abs and trt == 1970-01-01T00:00:00Z, then

rel = ts - POSIX(sst)

If a users has saved relative timestamps i.e. ts == rel and trt=sst, then

abs = ts + trt

With that said, the issue becomes one of human interpretability, so, I think the decision is up to those who will be interpreting.

oruebel commented 7 years ago

@ajtritt that makes sense.

After looking at it a bit more, I think I like the timestamps_offset approach better, mainly because the meaning/behavior of session_start_time does not change compared to the current approach if ``timestamps_offset=0. In contrast, withtimestamps_reference_timewe change the name of the reference and, hence, require folks that are used to the concept of times relative tosession_start_timeto change to usetimestamps_reference_timeand settimestamps_reference_time=session_start_time```

Either way, I think session_start_time should remain a required field.

neuromusic commented 7 years ago

Regarding backwards compatibility, in the current schema, there is only an implicit link between session_start_time and the reference time for timestamps: "Timestamps here have all been corrected to the common experiment master-clock. Time is stored as seconds and all timestamps are relative to experiment start time." It is not clear that session_start_time is what is being referenced here. The implicit link between the two is simply a convention that I suspect is largely unused.

So the change I propose does not actually break backwards compatability but instead makes what has been implicit explicit.

The key disadvantage of Proposal 1 is that two values are necessary to represent "absolute" time (both session_start_time and timestamps_offset). Further, the offset must be computed. To represent POSIX time, this means that there is now a burden on the person creating the file to compute the time delta (in POSIX timestamps) between Jan 1 1970 and session_start_time. e.g.

session_start_time = '2017-08-31T10:40:01-07:00'
timestamps_offset = <some computed int64 representing the delta between sst & Jan 1 1970>

# then for a given TimeSeries....

timestamps = <an array of int64>
timestamps.datatype = int64
timestamps.unit = 'POSIX64'

While both proposals represent the data, if I want to save my file with a different session_start_time, then I need to recompute my timestamps_offset.

Under both proposals, on reading, I can't assume that my timestamps are in POSIX64... I need to check what they are referenced to. Under proposal 1, I need to subtract (or add?) the timestamps_offset from session_start_time and check it against '1970-01-01T00:00:00Z'. Under Proposal 2, I just have to match the string saved in timestamps_reference_time against '1970-01-01T00:00:00Z'

neuromusic commented 7 years ago

in general, NWB needs fewer required fields, but I'm willing to save that argument for another day

lfrank commented 7 years ago

Just to add my final two cents, I don't feel at all strongly about this. I'm perfectly happy for my lab to effectively ignore session_start_time by setting it to '1970-01-01T00:00:00Z' and using the (implicit) assumption (which probably should be made explicit) that session start time is the reference time for all timestamps. But if adding a field is helpful to others, that's fine with me, and I don't have a preference as to how it is done.

My final comment is that I think that if any of us were presented with code that implemented either solution, we would quickly adjust to it and then not think much more about it. As soon as the Technical Advisory Board for NWB is fully formed they can deal with these issues, but until then my suggestion would be for Andrew and Oliver to make a decision that makes sense to them and to move on so we can keep the development going.

t-b commented 7 years ago

Time is stored as seconds and all timestamps are relative to experiment start time." It is not clear that session_start_time is what is being referenced here. The implicit link between the two is simply a convention that I suspect is largely unused.

I understood it like that and also create NWB files following this convention. It was a quite a hassle to base all measurement data on this common time zero of experiment start.

neuromusic commented 7 years ago

how do I get on this technical advisory board?

lfrank commented 7 years ago

Right now the plan is for Lydia to the the Allen representative (I think), so I'd suggest touching base with her or Christoph.

nicain commented 7 years ago

My understanding of the issue at play here is that there are two distinct temporal frames of reference under discussion: 1) RealTime (i.e. our shared experience of the world, and ignoring any effects from special relativity), and ExperimentTime. The “timestamps” field of the nwb file provides a coordinate system for ExperimentTime. The original spec requires that a required point, the beginning of the experiment (session_start_time), provide a common reference point to relate this ExperimentTime coordinate system (timestamps[0]) with RealTime, and there has been discussion on this thread about using either ISO_8601 or POSIX as the coordinate system for RealTime.

More controversy arises from the modeling community that claims that this RealTime/ExperimentTime relationship is not relevant for their data, and therefore want to make this session_start_time attribute optional. I agree with this.

Also, I want to point out that the units on the “timestamps“ field are only necessary if 1) there is no RealTime/ExperimentTime correspondence AND 2) there is not other common reference point between the RealTime coordinate system (either ISO_8601 or POSIX) and “timestamps”. This means that no units are required if both session_start_time and something like session_end_time are provided (and of course the value of timestamps[0] and timestamps[-1] coincide with these two fields).

My preference is to adopt ISO_8601 as the coordinate system for RealTime via session_start_time, but make its inclusion optional, as well as a new optional session_stop_time. Also, I would be fine making the “timestamps” unit optional, however if it is not specified then both session_start_time and session_stop_time must be provided. This is a backwards compatible change that increases flexibility for building the files, while ensuring that “timestamps” is always a unitful quantity.

If we cannot agree on a coordinate system for RealTime, then an additional piece of metadata that specifies either ISO_8601 or POSIX will be necessary, required any time session_start_time is required.

neuromusic commented 7 years ago

@nicain to clarify, the discussion on whether to use POSIX or ISO8601 for "RealTime" reference is #50.

This issue was prompted by the feature request by @lfrank (now retracted, I guess?) to support POSIX values (instead of seconds relative to experiment start) on the "ExperimentTime" timestamps themselves.

neuromusic commented 7 years ago

@ajtritt you are absolutely right that the two proposals are the same wrt to the data they represent. in fact, if I were implementing this in a python class, I would hide the implementation in hidden attributes and expose BOTH approaches as properties of the class, use the math you outlined to query either, and have setters and getters that either compute the timestamp_offset if timestamps_reference_time is passed or compute timestamps_reference_time if timestamp_offset is passed.

so practically, pynwb should be able to both proposals, regardless of what is decided here w/r/t to the schema.

so this discussion is about the

  1. there are cases in which an experimenter does not want to use session_start_time as a the reference to map "experiment time" to "real world time"
  2. sometimes an experimenter wants to define an alternative reference to "real world time"
  3. sometimes an experimenter does not want a reference to "real world time"

from the perspective of someone who needs to write these files and coerce my data into the spec, I find the explicit approach of Proposal 2 to be far more intuitive. Further, Proposal 1 only meets needs 1 & 2.

again, we've already made a decision to break backward compatibility with the schema, hence the "2.0" designation. if there is ever an appropriate time to break more things in the pursuit of clarity and improvement, this is it.

tjd2002 commented 6 years ago

With NWB 2.x spec being finalized, and now that solid ISO8601 support is in pynwb ( via https://github.com/NeurodataWithoutBorders/pynwb/pull/641 ), it seems like a good time to try to reach a decision on this issue!

I would like to throw my support to @neuromusic 's proposal ("proposal 2") for an additional, optional timestamps_reference_time field that encodes the timestamp zero time as an ISO8601 string. (As opposed to "Proposal 1": an optional field representing an offset (in seconds) that is subtracted from the session_start_time to obtain the reference time).

I believe we can achieve backwards-compatibility by declaring that: "if timestamps_reference_time is not specified, then the timestamp reference time is taken to be the session_start_time".

I (quite strongly) prefer the specification of a reference time (rather than an offset from session_start_time) as being simpler to use and more intuitive during both file write and read. For instance:

# For a file that uses POSIX timestamps

from datetime import datetime
from dateutil import tz

session_start_time        = datetime(2017, 10, 31,  9, 59,  0, tzinfo=tz.gettz('US/Pacific'))
timestamps_reference_time = datetime(1970,  1,  1,  0,  0,  0, tzinfo=tz.tzutc())
file_create_date          = datetime.now(tz.tzlocal())

t-b commented 6 years ago

@tjd2002 Proposal 2 from Andrew's post in https://github.com/NeurodataWithoutBorders/nwb-schema/issues/49#issuecomment-326391755?

tjd2002 commented 6 years ago

@t-b Yes.

Proposal 1 = "Timestamps_offset" Proposal 2 = "Timestamps_reference_time"

t-b commented 6 years ago

I can live with both proposals and would also be fine with Proposal 2. ISO8601 strings to the rescue!

ajtritt commented 6 years ago

Okay, proposal 2 it is... timestamps_reference_time as an ISO8601 string!

Any feelings on what timestamps_reference_time defaults to? I'm guessing most people will just want to store relative (to session_start_time) timestamps, so timestamps_reference_time can default to session_start_time.

@t-b @tjd2002 thoughts?

oruebel commented 6 years ago

Okay, proposal 2 it is.

I agree.

timestamps_reference_time can default to session_start_time.

Yes, based on the discussion in https://github.com/NeurodataWithoutBorders/nwb-schema/issues/50

tjd2002 commented 6 years ago

Great that we seem to have a consensus!

timestamps_reference_time can default to session_start_time.

This behavior at the API level sounds works for us and affords a level of backwards compatibility.

Am I correct that this means that timestamps_reference_time will become a required field in the spec, with a default value set by the API at write time? For those who wanted to make session_start_time optional, this might be unwelcome news (though it's not a lot worse for them than the current situation).

t-b commented 6 years ago

@tjd2002 Just a FYI. According to http://neurodatawithoutborders.github.io/development_plan this monday (11/05/2018) is the cut off date for schema changes.

tjd2002 commented 6 years ago

Thanks @t-b! I will take this on and make my best effort to have a PR ready for review on Monday.

Sent from my phone

On Nov 3, 2018, at 7:41 AM, Thomas Braun notifications@github.com wrote:

@tjd2002 Just a FYI. According to http://neurodatawithoutborders.github.io/development_plan this monday (11/05/2018) is the cut off date for schema changes.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.