Closed t-b closed 6 years ago
+1
I'd prefer to go with the human readable ISO8601 format for this, as it's immediately interpretable by someone viewing the file with generic HDF5 tools, e.g. HdfView. If we want a POSIX timestamp, which I can see value of for software use, I think that should be a separate field, and possibly the new field proposed in #49.
note @jonc125 that this issue (indeed, this entire repo) deals only with the schema and does not (necessarily) determine how the data is stored in HDF5. one of the major (and much needed) changes that @ajtritt & @oruebel have implemented in this refactoring of the NWB format is to separate the concerns of data storage, schema, and API
in other words, even if we decide here that the schema should be in ISO8601, the backend (whether HDF5 or PostgreSQL or whatever) could still store the data as a POSIX timestamp.
similarly, even if we decide here that the schema should be in ISO8601, the API (pynwb) could still accept POSIX timestamps and coerce them to ISO8601
@neuromusic But that would imply that we have a separate specification for the backend, or? Or how do you want otherwise exchange HDF5 files between NWB APIs without knowing how the file looks like?
yep. I don't understand this layer of the architecture very well myself, but FORM is the package that interprets the specification and implements it in the backend. currently, this lives in the pynwb repo (https://github.com/NeurodataWithoutBorders/pynwb/tree/e2d2f55d2df8bd7cd6f309d1a515c08925aefd76/src/form/), with the supported backends (currently only HDF5) in backends
.
I suspect that FORM will get spun off into a separate package once development has stabilized. perhaps @ajtritt & @oruebel can clarify?
In general, how the spec maps to specific entities in HDF5 has been somewhat implicit in NWB since the schema uses the same basic concepts that HDF5 is using (Group, Datasets, Attributes, Links etc.). As we keep refining NWB, I think one of the items that needs to happen is that we need to fill in the gaps where it is not clear or ambiguous how a particular item from the spec maps to HDF5. To start to address this problem I created a separate documentation for this here: http://nwb-storage.readthedocs.io/en/latest/storage_hdf5.html . The sources for this are in this repo in the folder docs/storage. All of this still needs refinement, but I hope it's at least a good starting point for this. In general my goal was to have for each main part of NWB a dedicated documentation. The sources for all of these docs are in the docs folder of this repo (i.e, aside from the PyNWB docs which live in the PyNWB repo). The following page gives an overview of the current set of items and has links to all the different documents http://nwb-overview.readthedocs.io/en/latest/nwbintro.html
@neuromusic I agree that whether FORM should be spun off into a separate repo is something we can revisit once developments have stabilized. Originally, what is now FORM was part of PyNWB itself and was then split out as a separate package so that FORM implements all the generic functionality and PyNWB implements the specifics and UI for NWB.
thank you for the clarification. so de facto my statement "even if we decide here that the schema should be in ISO8601, the backend (whether HDF5 or PostgreSQL or whatever) could still store the data as a POSIX timestamp." is not true, but theoretically could be (e.g. if FORM decided to coerce any ISO8601 strings to POSIX timestamps)?
@neuromusic with regard to ISO8601 vs. POSIX etc., I agree that we have some flexibility here in terms of what the API can accept, since we can translate to a common format when we store them to HDF5. For the NWB format itself however, we probably want to decide on one (or at least a limit set of) allowed ways to store this in HDF5. Regarding your last question, in general, you could theoretically do something like define ISO8601 in the schema and then say that the storage should use something else, but for the official NWB format I view this as more of a theoretical problem since what we say in the schema should be what we actually do. This kind of thing is more relevant once someone might develop their own custom backend for NWB, e.g., to integrate with their lab data management system and to be able to easily translate between their custom backend and the official NWB HDF5 data backend. For the purpose of this discussion, whatever we define in the schema should be what is used in the HDF5 backend.
For the purpose of this discussion, whatever we define in the schema should be what is used in the HDF5 backend.
Thanks for this clarification.
in that case, I vote that we only support ISO8601-strict, per @t-b's original suggestion
What is the procedure now? Do we all agree and I make a PR?
@oruebel @ajtritt @neuromusic I would like to push that change further.
My goal is:
Can I go further?
Why would this require a custom serialization?
Both use the standard serialization of datetime objects
In [1]: import datetime
In [2]: datetime.datetime.now()
Out[2]: datetime.datetime(2018, 9, 3, 11, 50, 39, 513509)
In [4]: str(datetime.datetime.now())
Out[4]: '2018-09-03 11:50:52.222236'
but for strict ISO8601 compliance we need a T
instead of
(space).
Oh ok I did not realize that. My preference would be
import pendulum
session_start_time = pendulum.now().to_iso8601_string()
'2018-09-03T18:22:52-07:00'
and
import datetime
dt = datetime.datetime(2018, 9, 3, 11, 50, 39, 513509)
file_creation_time = pendulum.instance(dt).in_timezone(pendulum.now().timezone).to_iso8601_string()
It would be great if we could avoid adding a third-party dependence for this
I agree, but getting a machine's local timezone is pretty involved and changes by OS: https://github.com/sdispater/pendulum/blob/master/pendulum/tz/local_timezone.py. Is there an easier way?
Maybe I'm missing something, but isn't the issue only how datetime.datetime.now()
is being serialized. Can't we just use this:
https://docs.python.org/2/library/datetime.html#datetime.datetime.isoformat
That requires the user to enter the timezone information:
class TZ(tzinfo):
def utcoffset(self, dt): return timedelta(minutes=-399)
I'd like to avoid this.
Why not always use UTC timestamps?
I think this is a good example of where computer scientists and neuroscientists would have different expectations. In CS applications it often makes sense to have a global clock because you frequently deal with systems that have components physically located all over the world. It rarely matters whether something happened at morning or at night, so the logical thing is to store UTC.
For neuroscientists, I would argue the opposite is true. When a neuroscientist asks when something occurred, they are usually interested in local time. It's much more likely that they would care whether something occurred in the day or night, since this has real neurological implications. If you store data as UTC, you lose this information and it is not recoverable. Our current solution offers no way of storing timezone information, which is not a workable solution for many neuroscience labs.
It's rare that you would use components in different timezones in data collection, but it does sometimes occur. That's why it's worth storing the timezone information so that we can recover UTC if we need to. I think this solution covers our use-cases much better than just UTC without timezone information.
Okay I've recently had a theory-hits-reality situation regarding this task. I would like to force users entering a timezone when creating a pynwb file, but if I need to convert data, in my case patchmaster/pclamp data, this might not have timezone information attached to it.
It's rare that you would use components in different timezones in data collection, but it does sometimes occur.
The file_create_date
holds one entry for each modification, and having data collection and data evaluation in different timezones can happen quite often.
If local time zone offset is important to keep (vs converting that to UTC) I'm fine with that as well.
So let's do the following:
NWBFile
)isoformat()
.Agreed?
Your patchmaster/pclamp data is in UTC?
The file_create_date holds one entry for each modification
OK well that is a misleading name.
and having data collection and data evaluation in different timezones can happen quite often.
Yeah I don't see a problem here if you include the timezone with every entry.
How is the user going to use timezone information? What do the pynwb
commands look like?
* Always use local time with timezone info inside pynwb (`NWBFile`)
I don't think it is a good idea to have the software automatically behave differently depending on what timezone the system is set to. I think timezone information needs to be provided by the user, even if allow the user to set a default timezone once.
More generally speaking. We should keep it as simple as possible to bring data to a common timezone. As such, if we store local times, I think we should always include the "offset to UTC" in the time.
How is the user going to use timezone information? What do the
pynwb
commands look like?
Currently PyNWB does not deal with converting timestamps between timezones. I can see how this is useful, but I'm also wondering whether that is something we want to make part of a separate utility class, rather than making part of the container classes itself.
As a point of comparison, Postgres stores timestamps in UTC, so it is up to the query inserting data to make any adjustments from some other time zone as an offset to UTC.
@t-b @bendichter just to summarize and clarify the action items on this:
doc
of the fields as we currently can't enforce a particular string formatting@t-b since you have been spearheading this discussion, can you implement these changes? Ideally we should do this within the next 3 week timeframe to make sure this makes it into the NWB2 full release. Please implement those changes (including those to the schema) in PyNWB (the schema is in src/pynwb/data in PyNWB). We'll then merge the schema from PyNWB over to the schema repo. @t-b I'll close #71 here, those changes should be subsumed by the above changes.
@oruebel Sounds good. I'll come up with something.
The NWB spec says "Date + time, Use ISO format (eg, ISO 8601) or a format that is easy to read and unambiguous."
https://github.com/NeurodataWithoutBorders/nwb-schema/blob/4dfe947265fda43dd3e70f1cde098361b8451828/core/nwb.file.yaml#L11
https://github.com/NeurodataWithoutBorders/nwb-schema/blob/4dfe947265fda43dd3e70f1cde098361b8451828/core/nwb.file.yaml#L31)
This definition is too vague for interaction across different APIs.
I therefore propose to change the wording to
Strict ISO8601 format including timezone information and T separator: “2017-08-28T23:24:47+02:00"
. Using a 64bit POSIX timestamp has the drawback that the reader needs to account for leap seconds, but would be equally fine for me.Note: This issue is different than #49 as it does not change the specification but just clarifies it.