Open CagtayFabry opened 4 years ago
I like this idea. I don't see this as being an undue burden on other languages, since they're free to deserialize the timestamps into a regular integer array. ASDF implementations would need to "remember" that the type was timestamp, but there are already other properties of ndarrays that need to be tracked and handled.
Some alternative implementation ideas:
subtype
field to ndarray and include timestamp_ns
and timedelta_ns
as options, so that users can efficiently store small timedeltas using containers less than 64 bits.http://asdf-format.org/schemas/ndarray_timedelta
. The schema would just be a simple $ref
to the ndarray schema (only possible if we remove the tag
property from core schemas, as suggested in #269)I also thought about something like subtype
(basically how we handly numpy datetime dtypes in our own classes currently). It might prevent bloating the supported dtypes and end up more flexible (similar to how big/little endian encoding is handled now).
When introducing http://asdf-format.org/schemas/ndarray_timedelta
I could see this leading to cases of
anyOf:
- tag: http://asdf-format.org/schemas/ndarray
- tag: http://asdf-format.org/schemas/ndarray_timedelta
in other schemas where both cases should be allowed. Could this be prevented? (same for using quantity
)
That's a good point, subtype
is more convenient for that sort of thing. We could add another custom validator to allow schema authors to restrict the subtype value.
It is possible to createhttp://asdf-format.org/schemas/ndarray_all
for easy access to that anyOf structure, but that seems more complicated than subtype
.
@perrygreenfield do you have any thoughts on this one?
That's a good point,
subtype
is more convenient for that sort of thing. We could add another custom validator to allow schema authors to restrict the subtype value.It is possible to create
http://asdf-format.org/schemas/ndarray_all
for easy access to that anyOf structure, but that seems more complicated thansubtype
.
I guess implementing subtype
should also make it very easy to define http://asdf-format.org/schemas/ndarray_timedelta
using allOf
without implementing a validator (could also be defined in cusotm extensions if necessary)
quick reminder that
tag: http://asdf-format.org/schemas/ndarray*
should also be possible now
I also thought about something like
subtype
(basically how we handly numpy datetime dtypes in our own classes currently). It might prevent bloating the supported dtypes and end up more flexible (similar to how big/little endian encoding is handled now).When introducing
http://asdf-format.org/schemas/ndarray_timedelta
I could see this leading to cases ofanyOf: - tag: http://asdf-format.org/schemas/ndarray - tag: http://asdf-format.org/schemas/ndarray_timedelta
in other schemas where both cases should be allowed. Could this be prevented? (same for using
quantity
)
Thanking for mentioning this issue! I'll read through this issue and start taking a look.
I spent some time looking into this today. One complication (that I don't yet have a solution for) is the associated unit for a np.datetime64
can have a number of possible values and means that the datatype
will need to not only encode datetime64
but also the unit
to interpret the bytes corresponding to a datetime64
array. Take the following example:
>> dt0 = np.datetime64(0xFFFF, ("s", 42))
>> dt0
numpy.datetime64('1970-02-01T20:34:30','42s')
>> dt0.tobytes()
b'\xff\xff\x00\x00\x00\x00\x00\x00'
>> dt1 = np.datetime64(0xFFFF, "D")
>> dt1
numpy.datetime64('2149-06-06')
>> dt1.tobytes()
b'\xff\xff\x00\x00\x00\x00\x00\x00'
>> dt0 == dt1
False
>> dt0.tobytes() == dt1.tobytes()
True
Conversion to a 'standard' unit will mean that some valid datetime64
values that use non-standard units will be unusable as the different units have different ranges.
I spent some time looking into this today. One complication (that I don't yet have a solution for) is the associated unit for a
np.datetime64
can have a number of possible values and means that thedatatype
will need to not only encodedatetime64
but also theunit
to interpret the bytes corresponding to adatetime64
array. Take the following example:>> dt0 = np.datetime64(0xFFFF, ("s", 42)) >> dt0 numpy.datetime64('1970-02-01T20:34:30','42s') >> dt0.tobytes() b'\xff\xff\x00\x00\x00\x00\x00\x00' >> dt1 = np.datetime64(0xFFFF, "D") >> dt1 numpy.datetime64('2149-06-06') >> dt1.tobytes() b'\xff\xff\x00\x00\x00\x00\x00\x00' >> dt0 == dt1 False >> dt0.tobytes() == dt1.tobytes() True
Conversion to a 'standard' unit will mean that some valid
datetime64
values that use non-standard units will be unusable as the different units have different ranges.
True, but that would have to be stored in the dtype
information of the asdf file anyway, as I think there is no simple datetime64
dtype without any unit (please correctly if I'm wrong).
To be fair, my initial example only listed the timedelta64[ns]': 'm8[ns]'
, 'datetime64[ns]': 'M8[ns]'
pairs, I didn't consider the different timescales back then.
for u in ["as", "fs", "ps", "ns", "us", "ms", "s", "m", "h", "D", "W", "M", "Y"]:
dtype = np.datetime64(0xFFFF, u).dtype
print(dtype.__repr__() + " : " + dtype.__str__())
dtype('<M8[as]') : datetime64[as]
dtype('<M8[fs]') : datetime64[fs]
dtype('<M8[ps]') : datetime64[ps]
dtype('<M8[ns]') : datetime64[ns]
dtype('<M8[us]') : datetime64[us]
dtype('<M8[ms]') : datetime64[ms]
dtype('<M8[s]') : datetime64[s]
dtype('<M8[m]') : datetime64[m]
dtype('<M8[h]') : datetime64[h]
dtype('<M8[D]') : datetime64[D]
dtype('<M8[W]') : datetime64[W]
dtype('<M8[M]') : datetime64[M]
dtype('<M8[Y]') : datetime64[Y]
of course, it seems improbable to cover any possible "custom" datetime type dtype like np.datetime64(0xFFFF, ("s", 42))
. Frankly I have no insight into where this functionality is used.
I wanted to update this with something more substantial at this point but unfortunately all i can say is I'm still looking into this.
I tried implementing this via an extension and things were complicated by the extension needing to follow every asdf standard version (like the NDArrayConverter
in asdf). This seems like too much of a burden to put on an extension (as it needs to effectively take over control of all ndarrays).
Do you have an example of code that works around this limitation (perhaps by converting datetime64
to an int32
)? I'm curious to see how much difficulty this issue produces.
The datetime64
and timedelta64
datatypes seem a little out of place in numpy. For example, I was unable to find the unit and increment via any dtype
attribute and had to rely on datetime_data. I have yet to sort out how these might fit into one of the ndarray
time
or quantity
schemas.
In the light of discussions around version 2.0 of the asdf-standard (and the version bump of all schemas) I would be interested to hear some opinions about extending the supported dtypes of
ndarray
. Specifically I am interested in adding support fordatetime
andtimedelta
like dtypes directly to thendarray
schema.I am aware of the existing
time/time-1.1.0
schema which while versatile and complex seems to be rather specific to astropy use cases in some regards. I think working with POSIX/unix datetimes with high (ns) precision is common in many scientific applications.Currently
core/ndarray-1.0.0
supports the basic (u)int, float and complex dtypes defined in the schema: https://github.com/asdf-format/asdf-standard/blob/29d34109e88a746abad5f9e85857133c39f45321/schemas/stsci.edu/asdf/core/ndarray-1.0.0.yaml#L190-L191The asdf python library handles the corresponding numpy mappings here:
When looking at numpy datetime arrays those are basically just integers interpreted as POSIX timestamps or timedeltas. Unfortunately we cannot store these in an ndarray directly without casting back to integer:
This makes handling of numpy
datetime
arrays somewhat irritating (I noticed this when working with pandas and xarray objects in asdf).I think natively supporting numpy
datetime
dtypes would simplify a lot of things when using asdf with other libraries that make use of numpysdatetime
dtypes, thus possibly expanding asdf to be used more widely (at least throughout the python/scipy ecosystem).In principle supporting more dtypes should be as easy as extending the standard schema und plugin lists for the asdf-standard schema as well as the python mapping (it seems to work but I have not looked into it in detail)
Of course one issues with adding dtypes to the core ndarray schema is that all libraries implementing the asdf-standard (asdf-cpp?) would have to add support for these specific datetime dtypes. Honestly I am not aware of how many asdf implementations there are for other languages and how difficult this would be to implement (probably not as easy as with python/numpy).
Another option could be to somehow allow an extension to add support for specific dtypes to ndarray. However I don't know if this can be done in the current implementation of the asdf-standard.