Clarify format of `session_start_time` and `file_creation_date`

t-b commented 7 years ago

The NWB spec says "Date + time, Use ISO format (eg, ISO 8601) or a format that is easy to read and unambiguous."

https://github.com/NeurodataWithoutBorders/nwb-schema/blob/4dfe947265fda43dd3e70f1cde098361b8451828/core/nwb.file.yaml#L11

https://github.com/NeurodataWithoutBorders/nwb-schema/blob/4dfe947265fda43dd3e70f1cde098361b8451828/core/nwb.file.yaml#L31)

This definition is too vague for interaction across different APIs.

I therefore propose to change the wording toStrict ISO8601 format including timezone information and T separator: “2017-08-28T23:24:47+02:00". Using a 64bit POSIX timestamp has the drawback that the reader needs to account for leap seconds, but would be equally fine for me.

Note: This issue is different than #49 as it does not change the specification but just clarifies it.

neuromusic commented 7 years ago

+1

jonc125 commented 7 years ago

I'd prefer to go with the human readable ISO8601 format for this, as it's immediately interpretable by someone viewing the file with generic HDF5 tools, e.g. HdfView. If we want a POSIX timestamp, which I can see value of for software use, I think that should be a separate field, and possibly the new field proposed in #49.

neuromusic commented 7 years ago

note @jonc125 that this issue (indeed, this entire repo) deals only with the schema and does not (necessarily) determine how the data is stored in HDF5. one of the major (and much needed) changes that @ajtritt & @oruebel have implemented in this refactoring of the NWB format is to separate the concerns of data storage, schema, and API

neuromusic commented 7 years ago

in other words, even if we decide here that the schema should be in ISO8601, the backend (whether HDF5 or PostgreSQL or whatever) could still store the data as a POSIX timestamp.

similarly, even if we decide here that the schema should be in ISO8601, the API (pynwb) could still accept POSIX timestamps and coerce them to ISO8601

t-b commented 7 years ago

@neuromusic But that would imply that we have a separate specification for the backend, or? Or how do you want otherwise exchange HDF5 files between NWB APIs without knowing how the file looks like?

neuromusic commented 7 years ago

yep. I don't understand this layer of the architecture very well myself, but FORM is the package that interprets the specification and implements it in the backend. currently, this lives in the pynwb repo (https://github.com/NeurodataWithoutBorders/pynwb/tree/e2d2f55d2df8bd7cd6f309d1a515c08925aefd76/src/form/), with the supported backends (currently only HDF5) in backends.

I suspect that FORM will get spun off into a separate package once development has stabilized. perhaps @ajtritt & @oruebel can clarify?

oruebel commented 7 years ago

In general, how the spec maps to specific entities in HDF5 has been somewhat implicit in NWB since the schema uses the same basic concepts that HDF5 is using (Group, Datasets, Attributes, Links etc.). As we keep refining NWB, I think one of the items that needs to happen is that we need to fill in the gaps where it is not clear or ambiguous how a particular item from the spec maps to HDF5. To start to address this problem I created a separate documentation for this here: http://nwb-storage.readthedocs.io/en/latest/storage_hdf5.html . The sources for this are in this repo in the folder docs/storage. All of this still needs refinement, but I hope it's at least a good starting point for this. In general my goal was to have for each main part of NWB a dedicated documentation. The sources for all of these docs are in the docs folder of this repo (i.e, aside from the PyNWB docs which live in the PyNWB repo). The following page gives an overview of the current set of items and has links to all the different documents http://nwb-overview.readthedocs.io/en/latest/nwbintro.html

@neuromusic I agree that whether FORM should be spun off into a separate repo is something we can revisit once developments have stabilized. Originally, what is now FORM was part of PyNWB itself and was then split out as a separate package so that FORM implements all the generic functionality and PyNWB implements the specifics and UI for NWB.

neuromusic commented 7 years ago

thank you for the clarification. so de facto my statement "even if we decide here that the schema should be in ISO8601, the backend (whether HDF5 or PostgreSQL or whatever) could still store the data as a POSIX timestamp." is not true, but theoretically could be (e.g. if FORM decided to coerce any ISO8601 strings to POSIX timestamps)?

oruebel commented 7 years ago

@neuromusic with regard to ISO8601 vs. POSIX etc., I agree that we have some flexibility here in terms of what the API can accept, since we can translate to a common format when we store them to HDF5. For the NWB format itself however, we probably want to decide on one (or at least a limit set of) allowed ways to store this in HDF5. Regarding your last question, in general, you could theoretically do something like define ISO8601 in the schema and then say that the storage should use something else, but for the official NWB format I view this as more of a theoretical problem since what we say in the schema should be what we actually do. This kind of thing is more relevant once someone might develop their own custom backend for NWB, e.g., to integrate with their lab data management system and to be able to easily translate between their custom backend and the official NWB HDF5 data backend. For the purpose of this discussion, whatever we define in the schema should be what is used in the HDF5 backend.

t-b commented 7 years ago

For the purpose of this discussion, whatever we define in the schema should be what is used in the HDF5 backend.

Thanks for this clarification.

neuromusic commented 7 years ago

in that case, I vote that we only support ISO8601-strict, per @t-b's original suggestion

t-b commented 7 years ago

What is the procedure now? Do we all agree and I make a PR?

t-b commented 6 years ago

@oruebel @ajtritt @neuromusic I would like to push that change further.

My goal is:

Define in the nwb-schema that the only valid timestamp format is ISO8601 with timezone information. (https://github.com/NeurodataWithoutBorders/nwb-schema/pull/71)
Change pynwb to accept ISO8601 timestamps and store it correctly including timezone information. For that we would need a custom serialization to string for datetime objects. A hint where I have to add that would be nice.

Can I go further?

bendichter commented 6 years ago

Why would this require a custom serialization?

t-b commented 6 years ago

Both use the standard serialization of datetime objects

In [1]: import datetime

In [2]: datetime.datetime.now()
Out[2]: datetime.datetime(2018, 9, 3, 11, 50, 39, 513509)

In [4]: str(datetime.datetime.now())
Out[4]: '2018-09-03 11:50:52.222236'

but for strict ISO8601 compliance we need a T instead of (space).

bendichter commented 6 years ago

Oh ok I did not realize that. My preference would be

import pendulum
session_start_time = pendulum.now().to_iso8601_string()

'2018-09-03T18:22:52-07:00'

and

import datetime
dt = datetime.datetime(2018, 9, 3, 11, 50, 39, 513509)
file_creation_time = pendulum.instance(dt).in_timezone(pendulum.now().timezone).to_iso8601_string()

oruebel commented 6 years ago

It would be great if we could avoid adding a third-party dependence for this

bendichter commented 6 years ago

I agree, but getting a machine's local timezone is pretty involved and changes by OS: https://github.com/sdispater/pendulum/blob/master/pendulum/tz/local_timezone.py. Is there an easier way?

oruebel commented 6 years ago

Maybe I'm missing something, but isn't the issue only how datetime.datetime.now() is being serialized. Can't we just use this:

https://docs.python.org/2/library/datetime.html#datetime.datetime.isoformat

bendichter commented 6 years ago

That requires the user to enter the timezone information:

 class TZ(tzinfo):
     def utcoffset(self, dt): return timedelta(minutes=-399)

I'd like to avoid this.

oruebel commented 6 years ago

Why not always use UTC timestamps?

bendichter commented 6 years ago

I think this is a good example of where computer scientists and neuroscientists would have different expectations. In CS applications it often makes sense to have a global clock because you frequently deal with systems that have components physically located all over the world. It rarely matters whether something happened at morning or at night, so the logical thing is to store UTC.

For neuroscientists, I would argue the opposite is true. When a neuroscientist asks when something occurred, they are usually interested in local time. It's much more likely that they would care whether something occurred in the day or night, since this has real neurological implications. If you store data as UTC, you lose this information and it is not recoverable. Our current solution offers no way of storing timezone information, which is not a workable solution for many neuroscience labs.

It's rare that you would use components in different timezones in data collection, but it does sometimes occur. That's why it's worth storing the timezone information so that we can recover UTC if we need to. I think this solution covers our use-cases much better than just UTC without timezone information.

t-b commented 6 years ago

Okay I've recently had a theory-hits-reality situation regarding this task. I would like to force users entering a timezone when creating a pynwb file, but if I need to convert data, in my case patchmaster/pclamp data, this might not have timezone information attached to it.

It's rare that you would use components in different timezones in data collection, but it does sometimes occur.

The file_create_date holds one entry for each modification, and having data collection and data evaluation in different timezones can happen quite often.

If local time zone offset is important to keep (vs converting that to UTC) I'm fine with that as well.

So let's do the following:

Adapt schema to allow only ISO8601 timestamps and strongly suggest to use timezone information as well.
Always use local time with timezone info inside pynwb (NWBFile)
Change serialization of datetime to use isoformat().
Add tests to ensure that it works and stays like that
Document changes in release notes

Agreed?

bendichter commented 6 years ago

Your patchmaster/pclamp data is in UTC?

The file_create_date holds one entry for each modification

OK well that is a misleading name.

and having data collection and data evaluation in different timezones can happen quite often.

Yeah I don't see a problem here if you include the timezone with every entry.

How is the user going to use timezone information? What do the pynwb commands look like?

oruebel commented 6 years ago

* Always use local time with timezone info inside pynwb (`NWBFile`)

I don't think it is a good idea to have the software automatically behave differently depending on what timezone the system is set to. I think timezone information needs to be provided by the user, even if allow the user to set a default timezone once.

More generally speaking. We should keep it as simple as possible to bring data to a common timezone. As such, if we store local times, I think we should always include the "offset to UTC" in the time.

How is the user going to use timezone information? What do the pynwb commands look like?

Currently PyNWB does not deal with converting timestamps between timezones. I can see how this is useful, but I'm also wondering whether that is something we want to make part of a separate utility class, rather than making part of the container classes itself.

mgrauer commented 6 years ago

As a point of comparison, Postgres stores timestamps in UTC, so it is up to the query inserting data to make any adjustments from some other time zone as an offset to UTC.

oruebel commented 6 years ago

@t-b @bendichter just to summarize and clarify the action items on this:

Adapt schema to allow only ISO8601 timestamps and require use of timezone offsets. In the schema this is currently just a comment on in the doc of the fields as we currently can't enforce a particular string formatting
In PyNWB change serialization of datetime to use isoformat()
In PyNWB require in NWBFile.init that the user provide a datetime object with timezone information set. This means: 1) don't allow str as input for session_start_time, 2) Check in NWBFile.init that the timezone is set, 3) if timezone is not set show a warning that should explain to users how to set the timezone information (and set UTC+0) as timezone (alternatively create ValueError instead).
Add tests to ensure that it works and stays like that
Document changes in release notes

@t-b since you have been spearheading this discussion, can you implement these changes? Ideally we should do this within the next 3 week timeframe to make sure this makes it into the NWB2 full release. Please implement those changes (including those to the schema) in PyNWB (the schema is in src/pynwb/data in PyNWB). We'll then merge the schema from PyNWB over to the schema repo. @t-b I'll close #71 here, those changes should be subsumed by the above changes.

t-b commented 6 years ago

@oruebel Sounds good. I'll come up with something.

NeurodataWithoutBorders / nwb-schema

Clarify format of `session_start_time` and `file_creation_date` #50