Closed cpcloud closed 9 years ago
Cool. Thanks for working on this. Let me play with it for a sec.
In [5]: dshape('timedelta')
Out[5]: dshape("TimeDelta(unit='us')")
In [6]: assert eval(repr(_)) == _
File <nofile>, line 1
TimeDelta(unit='us')
^
DataShapeSyntaxError: Unexpected token in datashape
In [13]: dshape('timedelta["ns"]')
Out[13]: dshape("TimeDelta(unit=u'ns')")
Why the unicode here?
@mwiebe thoughts on requiring quotes around parameters like "ascii"
and "ns"
? It'd be a slicker user experience to avoid these, at least for a predefined set of special cases.
dshape('timedelta[ns]')
rather than
dshape('timedelta["ns"]')
In [14]: dshape('timedelta["foo"]')
Out[14]: dshape("TimeDelta(unit=u'foo')")
In [15]: dshape('string["ascii"]')
Out[15]: dshape("string['A']")
In [16]: dshape('string["foo"]')
ValueError: Unsupported string encoding u'foo'
In [3]: from_numpy((10,), 'm8[D]')
/home/mrocklin/workspace/datashape/datashape/coretypes.pyc in from_numpy(shape, dt)
1175 measure = Record(rec)
1176 else:
-> 1177 measure = CType.from_numpy_dtype(dtype)
1178
1179 if shape == ():
/home/mrocklin/workspace/datashape/datashape/coretypes.pyc in from_numpy_dtype(self, dt)
720 elif np.issubdtype(dt, np.str_) or np.issubdtype(dt, np.bytes_):
721 return String(dt.itemsize, 'ascii')
--> 722 raise NotImplementedError("NumPy datatype %s not supported" % dt)
723
724 @property
NotImplementedError: NumPy datatype timedelta64[D] not supported
ah yes the enigmatic foo
unit
Let me know if you want help with this. I'm happy to contribute by raising errors or add code or both.
very useful to have someone raising errors
much appreciated
fwiw the unicode is stripped out with a .strip('u')
in the string
type
i'm also +1 on this syntax timedelta[unit=us]
and/or timedelta[us]
fwiw the unicode is stripped out with a .strip('u') in the string type
That seems highly non-robust. What about "utf-8"
? Seems like calling str
would resolve that problem in a more direct way.
Ah, everything else is guarded by quotes. Still. str
is what was intended there...
Sorry if I drone on a bit on some of these things. Datashape works well but is a bit of a coding mess. It was sort of handed from person to person over time and has a few different conflicting voices.
Looks like supporting timedelta[us]
would require going into parser.py
.
git blame
shows that @mwiebe is the expert there. Lets wait to see what his thoughts are.
alrighty @mrocklin ready for another look i think
i removed the xfail
on discover
y of time
-like strings and made fancy timedelta strings parse
@mrocklin i used your normalize_time_unit
with a small change: basically inverted the dict
to get numpy like names out
let me know if that works for you, or i can add the others back if you like
the same inputs still work as in your truncate branch, just now the output is numpy style units
FYI you need to add numpy >= 1.7 as a dep (rather than accepting any version) as horribly broken prior to that
also u can steal parsing routines from pandas for string to TD (tseries/timedeltas.py)
+1 on the numpy version
hm i don't think we want pandas as a dep and the complexity of that code looks like a bit much to copypaste
maybe trying to import pandas if that fails then import the copypasted version
crazy idea: how about a small library for parsing timedeltas?
@jreback @mrocklin @mwiebe thoughts here?
fwiw pandas supports a lot more fancy strings than i've currently implemented here.
well pandas always wanted to take out the datetime string parsing out anyway
so string-> datetime like would be +1
for scalars and arrays would be nice
maybe another library is overkill.... just want to avoid a maintenance nightmare
Some quick comments:
@mwiebe
To be consistent with the datetime previously made, timedelta would just have one unit, not be parameterized by unit
is there any issue with upgrading datetime
to be parameterized by a unit?
Putting the timeparse and timedeltaparse functions in datashape seems inconsistent with the library so far, wouldn't that be more in scope for blaze because it's about the values rather than the types?
those functions work the same way that dateparse
(this is an alias for dateutil.parser.parse
) does: if a value is successfully returned, then pass that value to another discover
that's been dispatched on that type. i explicitly tried to be consistent there.
I support the convenience of datashapes like "string[utf-8]", but it does require some context-sensitive parsing.
this would be a nice to have, and is kind of low on my priority list, and if i did do it, then it would be in a separate PR.
I'm not opposed to a pandas dependency. I also wouldn't mind a tiny library for datetime parsing.
Also curious on Mark's thoughts on the benefits/drawbacks of parametrizing datetime
.
Putting the timeparse and timedeltaparse functions in datashape seems inconsistent with the library so far, wouldn't that be more in scope for blaze because it's about the values rather than the types?
For better or worse, discover(str)
currently lives in datashape
. At the time I felt it was valuable enough for the datashape
library to expand scope to include this. We could also move this to Blaze if we felt it was too much.
Punting on removing quotes around parameters seems reasonable. Good to know that it's not a bad idea though. I'll raise an issue.
I put my thoughts on the datetime units parameter in a blaze design document, which has now migrated here:
https://github.com/ContinuumIO/libdynd/blob/master/docs/datetime-design.md#no-unit-parameter
I think a nice approach might be two split it into two types: datetime (64-bit) and high precision datetime (128-bit), roughly similar to what's in Bloomberg's libraries here:
alright, i'll make the unit default to us
Seems like a reasonable choice when designing a datetime implementation. If datashape is trying to represent types found in foreign systems then it seems like it might be dangerous to be this opinionated.
As an example. Imagine that we want to translate types between a SQL database and a NumPy array. How do our choices on datashape affect our ability to translate meaningfully?
I think that either we have a variety of units or we have no units (as with the current datetime).
If we're going the support foreign systems route (which I usually support) then we should probably look around at what systems actually parametrize their timedeltas/intervals?
Pandas: No NumPy: Yes DyND: No Python: No SQL: ?? Mongo: ??
maybe i'm going about this all wrong.
>>> 3 * nd.units.second
where does this live?
pandas and Numpy keep the values internal in nanoseconds while python keeps as microseconds
you don't need to keep units except for display purposes that is frequency conversion
FYI http://pandas.pydata.org/pandas-docs/stable/timedeltas.html
In [17]: nd.units.second
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-17-b86df4a47799> in <module>()
----> 1 nd.units.second
AttributeError: 'module' object has no attribute 'units'
@cpcloud That's the part of the design I mentioned before in a separate channel ("My thought earlier was to begin a unit type, but didn't get to it"). It's unfortunately not implemented yet, sorry.
ah ok
http://dev.mysql.com/doc/refman/5.5/en/date-and-time-functions.html#function_timediff
kind of complex the way things are handled in mysql
postgres looks fairly sane:
http://www.postgresql.org/docs/9.1/static/datatype-datetime.html
scroll down to intervals
they have units
looks like mssql also has units
oracle is similar to postgres
oracle has a sort of pseudo units: years/months and day/hours/minutes/seconds lumped into two categories
"You can specify these differences in terms of years and months, or in terms of days, hours, minutes, and seconds."
https://docs.oracle.com/cd/B19306_01/server.102/b14200/sql_elements003.htm#i38598
mongo seems to think only in milliseconds
http://docs.mongodb.org/manual/reference/operator/aggregation/subtract/
Need to add pandas to requirements.txt and meta.yaml. @mwiebe are you ok with the added dependency?
@mrocklin Right. I'll fix that.
@mrocklin @mwiebe
Re unit vs. no unit. I'm in favor of units. Here's why.
Since datashape
and blaze
are designed to work in tandem, we have to answer the question "How do I stick a timedelta
column from backend A into backend B?" Different backends have different representations of timedelta
s and within certain backends there are multiple ways to specify the representation. This isn't so much a problem for blaze
expressions (I could imagine e.g., t.diff.minutes
giving back a dshape
of timedelta[m]
) as it is for into
.
In the case of MongoDB -> Postgres (for example) how do we specify to Postgres that you want to store the timedelta
from mongo as a day interval vs. as a seconds interval? As you can see in the above links the SQL implementations of timedelta
are varied in their representation. Instead of forcing backends to use our chosen unit, we can make users specify the unit, which IMO is less fragile. There's more user facing complexity, but fewer assumptions.
closes #109