asdf-format / asdf-standard

Standards document describing ASDF, Advanced Scientific Data Format
http://asdf-standard.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
72 stars 29 forks source link

adding datetime-like dtypes to ndarray #270

Open CagtayFabry opened 4 years ago

CagtayFabry commented 4 years ago

In the light of discussions around version 2.0 of the asdf-standard (and the version bump of all schemas) I would be interested to hear some opinions about extending the supported dtypes of ndarray. Specifically I am interested in adding support for datetime and timedelta like dtypes directly to the ndarray schema.

I am aware of the existing time/time-1.1.0 schema which while versatile and complex seems to be rather specific to astropy use cases in some regards. I think working with POSIX/unix datetimes with high (ns) precision is common in many scientific applications.

Currently core/ndarray-1.0.0 supports the basic (u)int, float and complex dtypes defined in the schema: https://github.com/asdf-format/asdf-standard/blob/29d34109e88a746abad5f9e85857133c39f45321/schemas/stsci.edu/asdf/core/ndarray-1.0.0.yaml#L190-L191

The asdf python library handles the corresponding numpy mappings here:

_datatype_names = {
    'int8'       : 'i1',
    'int16'      : 'i2',
    'int32'      : 'i4',
    'int64'      : 'i8',
    'uint8'      : 'u1',
    'uint16'     : 'u2',
    'uint32'     : 'u4',
    'uint64'     : 'u8',
    'float32'    : 'f4',
    'float64'    : 'f8',
    'complex64'  : 'c8',
    'complex128' : 'c16',
    'bool8'      : 'b1'
}

When looking at numpy datetime arrays those are basically just integers interpreted as POSIX timestamps or timedeltas. Unfortunately we cannot store these in an ndarray directly without casting back to integer:

import numpy as np
import asdf

tree = {"times":np.arange(0,3,dtype="datetime64[ns]")}
with asdf.AsdfFile(
    tree,
) as ff:
    ff.write_to("datetimes.asdf")

>>> ValueError: cannot include dtype 'M' in a buffer

This makes handling of numpy datetime arrays somewhat irritating (I noticed this when working with pandas and xarray objects in asdf).

I think natively supporting numpy datetime dtypes would simplify a lot of things when using asdf with other libraries that make use of numpys datetime dtypes, thus possibly expanding asdf to be used more widely (at least throughout the python/scipy ecosystem).

In principle supporting more dtypes should be as easy as extending the standard schema und plugin lists for the asdf-standard schema as well as the python mapping (it seems to work but I have not looked into it in detail)

enum: [int8, uint8, int16, uint16, int32, uint32, int64, uint64,
       float32, float64, complex64, complex128, bool8, "timedelta64[ns]", "datetime64[ns]"]
_datatype_names = {
    'int8'       : 'i1',
    'int16'      : 'i2',
    'int32'      : 'i4',
    'int64'      : 'i8',
    'uint8'      : 'u1',
    'uint16'     : 'u2',
    'uint32'     : 'u4',
    'uint64'     : 'u8',
    'float32'    : 'f4',
    'float64'    : 'f8',
    'complex64'  : 'c8',
    'complex128' : 'c16',
    'bool8'      : 'b1',
    'timedelta64[ns]': 'm8[ns]',
    'datetime64[ns]': 'M8[ns]'
}

Of course one issues with adding dtypes to the core ndarray schema is that all libraries implementing the asdf-standard (asdf-cpp?) would have to add support for these specific datetime dtypes. Honestly I am not aware of how many asdf implementations there are for other languages and how difficult this would be to implement (probably not as easy as with python/numpy).

Another option could be to somehow allow an extension to add support for specific dtypes to ndarray. However I don't know if this can be done in the current implementation of the asdf-standard.

eslavich commented 4 years ago

I like this idea. I don't see this as being an undue burden on other languages, since they're free to deserialize the timestamps into a regular integer array. ASDF implementations would need to "remember" that the type was timestamp, but there are already other properties of ndarrays that need to be tracked and handled.

Some alternative implementation ideas:

CagtayFabry commented 4 years ago

I also thought about something like subtype (basically how we handly numpy datetime dtypes in our own classes currently). It might prevent bloating the supported dtypes and end up more flexible (similar to how big/little endian encoding is handled now).

When introducing http://asdf-format.org/schemas/ndarray_timedelta I could see this leading to cases of

anyOf:
  - tag: http://asdf-format.org/schemas/ndarray
  - tag: http://asdf-format.org/schemas/ndarray_timedelta

in other schemas where both cases should be allowed. Could this be prevented? (same for using quantity)

eslavich commented 4 years ago

That's a good point, subtype is more convenient for that sort of thing. We could add another custom validator to allow schema authors to restrict the subtype value.

It is possible to createhttp://asdf-format.org/schemas/ndarray_all for easy access to that anyOf structure, but that seems more complicated than subtype.

eslavich commented 4 years ago

@perrygreenfield do you have any thoughts on this one?

CagtayFabry commented 4 years ago

That's a good point, subtype is more convenient for that sort of thing. We could add another custom validator to allow schema authors to restrict the subtype value.

It is possible to createhttp://asdf-format.org/schemas/ndarray_all for easy access to that anyOf structure, but that seems more complicated than subtype.

I guess implementing subtype should also make it very easy to define http://asdf-format.org/schemas/ndarray_timedelta using allOf without implementing a validator (could also be defined in cusotm extensions if necessary)

CagtayFabry commented 3 years ago

quick reminder that

  tag: http://asdf-format.org/schemas/ndarray*

should also be possible now

I also thought about something like subtype (basically how we handly numpy datetime dtypes in our own classes currently). It might prevent bloating the supported dtypes and end up more flexible (similar to how big/little endian encoding is handled now).

When introducing http://asdf-format.org/schemas/ndarray_timedelta I could see this leading to cases of

anyOf:
  - tag: http://asdf-format.org/schemas/ndarray
  - tag: http://asdf-format.org/schemas/ndarray_timedelta

in other schemas where both cases should be allowed. Could this be prevented? (same for using quantity)

braingram commented 10 months ago

Thanking for mentioning this issue! I'll read through this issue and start taking a look.

braingram commented 10 months ago

I spent some time looking into this today. One complication (that I don't yet have a solution for) is the associated unit for a np.datetime64 can have a number of possible values and means that the datatype will need to not only encode datetime64 but also the unit to interpret the bytes corresponding to a datetime64 array. Take the following example:

>> dt0 = np.datetime64(0xFFFF, ("s", 42))
>> dt0
numpy.datetime64('1970-02-01T20:34:30','42s')
>> dt0.tobytes()
b'\xff\xff\x00\x00\x00\x00\x00\x00'
>> dt1 = np.datetime64(0xFFFF, "D")
>> dt1
numpy.datetime64('2149-06-06')
>> dt1.tobytes()
b'\xff\xff\x00\x00\x00\x00\x00\x00'
>> dt0 == dt1
False
>> dt0.tobytes() == dt1.tobytes()
True

Conversion to a 'standard' unit will mean that some valid datetime64 values that use non-standard units will be unusable as the different units have different ranges.

CagtayFabry commented 10 months ago

I spent some time looking into this today. One complication (that I don't yet have a solution for) is the associated unit for a np.datetime64 can have a number of possible values and means that the datatype will need to not only encode datetime64 but also the unit to interpret the bytes corresponding to a datetime64 array. Take the following example:

>> dt0 = np.datetime64(0xFFFF, ("s", 42))
>> dt0
numpy.datetime64('1970-02-01T20:34:30','42s')
>> dt0.tobytes()
b'\xff\xff\x00\x00\x00\x00\x00\x00'
>> dt1 = np.datetime64(0xFFFF, "D")
>> dt1
numpy.datetime64('2149-06-06')
>> dt1.tobytes()
b'\xff\xff\x00\x00\x00\x00\x00\x00'
>> dt0 == dt1
False
>> dt0.tobytes() == dt1.tobytes()
True

Conversion to a 'standard' unit will mean that some valid datetime64 values that use non-standard units will be unusable as the different units have different ranges.

True, but that would have to be stored in the dtype information of the asdf file anyway, as I think there is no simple datetime64 dtype without any unit (please correctly if I'm wrong). To be fair, my initial example only listed the timedelta64[ns]': 'm8[ns]', 'datetime64[ns]': 'M8[ns]' pairs, I didn't consider the different timescales back then.

for u in ["as", "fs", "ps", "ns", "us", "ms", "s", "m", "h", "D", "W", "M", "Y"]:
    dtype = np.datetime64(0xFFFF, u).dtype
    print(dtype.__repr__() + " : " + dtype.__str__())

dtype('<M8[as]') : datetime64[as]
dtype('<M8[fs]') : datetime64[fs]
dtype('<M8[ps]') : datetime64[ps]
dtype('<M8[ns]') : datetime64[ns]
dtype('<M8[us]') : datetime64[us]
dtype('<M8[ms]') : datetime64[ms]
dtype('<M8[s]') : datetime64[s]
dtype('<M8[m]') : datetime64[m]
dtype('<M8[h]') : datetime64[h]
dtype('<M8[D]') : datetime64[D]
dtype('<M8[W]') : datetime64[W]
dtype('<M8[M]') : datetime64[M]
dtype('<M8[Y]') : datetime64[Y]

of course, it seems improbable to cover any possible "custom" datetime type dtype like np.datetime64(0xFFFF, ("s", 42)). Frankly I have no insight into where this functionality is used.

braingram commented 9 months ago

I wanted to update this with something more substantial at this point but unfortunately all i can say is I'm still looking into this.

I tried implementing this via an extension and things were complicated by the extension needing to follow every asdf standard version (like the NDArrayConverter in asdf). This seems like too much of a burden to put on an extension (as it needs to effectively take over control of all ndarrays).

Do you have an example of code that works around this limitation (perhaps by converting datetime64 to an int32)? I'm curious to see how much difficulty this issue produces.

The datetime64 and timedelta64 datatypes seem a little out of place in numpy. For example, I was unable to find the unit and increment via any dtype attribute and had to rely on datetime_data. I have yet to sort out how these might fit into one of the ndarray time or quantity schemas.