blaze / datashape

Language defining a data description protocol
BSD 2-Clause "Simplified" License
183 stars 65 forks source link

Add TimeDelta to datashape #111

Closed cpcloud closed 9 years ago

cpcloud commented 9 years ago

closes #109

mrocklin commented 9 years ago

Cool. Thanks for working on this. Let me play with it for a sec.

mrocklin commented 9 years ago
In [5]: dshape('timedelta')
Out[5]: dshape("TimeDelta(unit='us')")

In [6]: assert eval(repr(_)) == _
  File <nofile>, line 1
    TimeDelta(unit='us')
             ^

DataShapeSyntaxError: Unexpected token in datashape
mrocklin commented 9 years ago
In [13]: dshape('timedelta["ns"]')
Out[13]: dshape("TimeDelta(unit=u'ns')")

Why the unicode here?

@mwiebe thoughts on requiring quotes around parameters like "ascii" and "ns"? It'd be a slicker user experience to avoid these, at least for a predefined set of special cases.

dshape('timedelta[ns]')

rather than

dshape('timedelta["ns"]') 
mrocklin commented 9 years ago
In [14]: dshape('timedelta["foo"]')
Out[14]: dshape("TimeDelta(unit=u'foo')")

In [15]: dshape('string["ascii"]')
Out[15]: dshape("string['A']")

In [16]: dshape('string["foo"]')
ValueError: Unsupported string encoding u'foo'
mrocklin commented 9 years ago
In [3]: from_numpy((10,), 'm8[D]')
/home/mrocklin/workspace/datashape/datashape/coretypes.pyc in from_numpy(shape, dt)
   1175         measure = Record(rec)
   1176     else:
-> 1177         measure = CType.from_numpy_dtype(dtype)
   1178 
   1179     if shape == ():

/home/mrocklin/workspace/datashape/datashape/coretypes.pyc in from_numpy_dtype(self, dt)
    720         elif np.issubdtype(dt, np.str_) or np.issubdtype(dt, np.bytes_):
    721             return String(dt.itemsize, 'ascii')
--> 722         raise NotImplementedError("NumPy datatype %s not supported" % dt)
    723 
    724     @property

NotImplementedError: NumPy datatype timedelta64[D] not supported
cpcloud commented 9 years ago

ah yes the enigmatic foo unit

mrocklin commented 9 years ago

Let me know if you want help with this. I'm happy to contribute by raising errors or add code or both.

cpcloud commented 9 years ago

very useful to have someone raising errors

cpcloud commented 9 years ago

much appreciated

cpcloud commented 9 years ago

fwiw the unicode is stripped out with a .strip('u') in the string type

cpcloud commented 9 years ago

i'm also +1 on this syntax timedelta[unit=us] and/or timedelta[us]

mrocklin commented 9 years ago

fwiw the unicode is stripped out with a .strip('u') in the string type

That seems highly non-robust. What about "utf-8"? Seems like calling str would resolve that problem in a more direct way.

mrocklin commented 9 years ago

Ah, everything else is guarded by quotes. Still. str is what was intended there...

Sorry if I drone on a bit on some of these things. Datashape works well but is a bit of a coding mess. It was sort of handed from person to person over time and has a few different conflicting voices.

mrocklin commented 9 years ago

Looks like supporting timedelta[us] would require going into parser.py.

mrocklin commented 9 years ago

git blame shows that @mwiebe is the expert there. Lets wait to see what his thoughts are.

cpcloud commented 9 years ago

alrighty @mrocklin ready for another look i think

i removed the xfail on discovery of time-like strings and made fancy timedelta strings parse

cpcloud commented 9 years ago

@mrocklin i used your normalize_time_unit with a small change: basically inverted the dict to get numpy like names out

let me know if that works for you, or i can add the others back if you like

the same inputs still work as in your truncate branch, just now the output is numpy style units

jreback commented 9 years ago

FYI you need to add numpy >= 1.7 as a dep (rather than accepting any version) as horribly broken prior to that

jreback commented 9 years ago

also u can steal parsing routines from pandas for string to TD (tseries/timedeltas.py)

cpcloud commented 9 years ago

+1 on the numpy version

hm i don't think we want pandas as a dep and the complexity of that code looks like a bit much to copypaste

maybe trying to import pandas if that fails then import the copypasted version

crazy idea: how about a small library for parsing timedeltas?

@jreback @mrocklin @mwiebe thoughts here?

fwiw pandas supports a lot more fancy strings than i've currently implemented here.

jreback commented 9 years ago

well pandas always wanted to take out the datetime string parsing out anyway

so string-> datetime like would be +1

for scalars and arrays would be nice

cpcloud commented 9 years ago

maybe another library is overkill.... just want to avoid a maintenance nightmare

mwiebe commented 9 years ago

Some quick comments:

cpcloud commented 9 years ago

@mwiebe

To be consistent with the datetime previously made, timedelta would just have one unit, not be parameterized by unit

is there any issue with upgrading datetime to be parameterized by a unit?

Putting the timeparse and timedeltaparse functions in datashape seems inconsistent with the library so far, wouldn't that be more in scope for blaze because it's about the values rather than the types?

those functions work the same way that dateparse (this is an alias for dateutil.parser.parse) does: if a value is successfully returned, then pass that value to another discover that's been dispatched on that type. i explicitly tried to be consistent there.

I support the convenience of datashapes like "string[utf-8]", but it does require some context-sensitive parsing.

this would be a nice to have, and is kind of low on my priority list, and if i did do it, then it would be in a separate PR.

mrocklin commented 9 years ago

I'm not opposed to a pandas dependency. I also wouldn't mind a tiny library for datetime parsing.

Also curious on Mark's thoughts on the benefits/drawbacks of parametrizing datetime.

Putting the timeparse and timedeltaparse functions in datashape seems inconsistent with the library so far, wouldn't that be more in scope for blaze because it's about the values rather than the types?

For better or worse, discover(str) currently lives in datashape. At the time I felt it was valuable enough for the datashape library to expand scope to include this. We could also move this to Blaze if we felt it was too much.

Punting on removing quotes around parameters seems reasonable. Good to know that it's not a bad idea though. I'll raise an issue.

mwiebe commented 9 years ago

I put my thoughts on the datetime units parameter in a blaze design document, which has now migrated here:

https://github.com/ContinuumIO/libdynd/blob/master/docs/datetime-design.md#no-unit-parameter

I think a nice approach might be two split it into two types: datetime (64-bit) and high precision datetime (128-bit), roughly similar to what's in Bloomberg's libraries here:

https://github.com/bloomberg/blpapi-node/blob/master/deps/blpapi/include-3.8.1.1/blpapi_datetime.h#L63

cpcloud commented 9 years ago

alright, i'll make the unit default to us

mrocklin commented 9 years ago

Seems like a reasonable choice when designing a datetime implementation. If datashape is trying to represent types found in foreign systems then it seems like it might be dangerous to be this opinionated.

As an example. Imagine that we want to translate types between a SQL database and a NumPy array. How do our choices on datashape affect our ability to translate meaningfully?

mrocklin commented 9 years ago

I think that either we have a variety of units or we have no units (as with the current datetime).

If we're going the support foreign systems route (which I usually support) then we should probably look around at what systems actually parametrize their timedeltas/intervals?

Pandas: No NumPy: Yes DyND: No Python: No SQL: ?? Mongo: ??

cpcloud commented 9 years ago

maybe i'm going about this all wrong.

>>> 3 * nd.units.second

where does this live?

jreback commented 9 years ago

pandas and Numpy keep the values internal in nanoseconds while python keeps as microseconds

you don't need to keep units except for display purposes that is frequency conversion

FYI http://pandas.pydata.org/pandas-docs/stable/timedeltas.html

cpcloud commented 9 years ago
In [17]: nd.units.second
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-17-b86df4a47799> in <module>()
----> 1 nd.units.second

AttributeError: 'module' object has no attribute 'units'
mwiebe commented 9 years ago

@cpcloud That's the part of the design I mentioned before in a separate channel ("My thought earlier was to begin a unit type, but didn't get to it"). It's unfortunately not implemented yet, sorry.

cpcloud commented 9 years ago

ah ok

cpcloud commented 9 years ago

http://dev.mysql.com/doc/refman/5.5/en/date-and-time-functions.html#function_timediff

kind of complex the way things are handled in mysql

  1. fractional exists but isn't stored in the db
  2. when diffing datetimes you have to specify a unit
cpcloud commented 9 years ago

postgres looks fairly sane:

http://www.postgresql.org/docs/9.1/static/datatype-datetime.html

scroll down to intervals

they have units

cpcloud commented 9 years ago

looks like mssql also has units

http://msdn.microsoft.com/en-us/library/ms189794.aspx

cpcloud commented 9 years ago

oracle is similar to postgres

oracle has a sort of pseudo units: years/months and day/hours/minutes/seconds lumped into two categories

"You can specify these differences in terms of years and months, or in terms of days, hours, minutes, and seconds."

https://docs.oracle.com/cd/B19306_01/server.102/b14200/sql_elements003.htm#i38598

cpcloud commented 9 years ago

mongo seems to think only in milliseconds

http://docs.mongodb.org/manual/reference/operator/aggregation/subtract/

mrocklin commented 9 years ago

Need to add pandas to requirements.txt and meta.yaml. @mwiebe are you ok with the added dependency?

cpcloud commented 9 years ago

@mrocklin Right. I'll fix that.

@mrocklin @mwiebe

Re unit vs. no unit. I'm in favor of units. Here's why.

Since datashape and blaze are designed to work in tandem, we have to answer the question "How do I stick a timedelta column from backend A into backend B?" Different backends have different representations of timedeltas and within certain backends there are multiple ways to specify the representation. This isn't so much a problem for blaze expressions (I could imagine e.g., t.diff.minutes giving back a dshape of timedelta[m]) as it is for into.

In the case of MongoDB -> Postgres (for example) how do we specify to Postgres that you want to store the timedelta from mongo as a day interval vs. as a seconds interval? As you can see in the above links the SQL implementations of timedelta are varied in their representation. Instead of forcing backends to use our chosen unit, we can make users specify the unit, which IMO is less fragile. There's more user facing complexity, but fewer assumptions.