Closed waitingkuo closed 2 years ago
Thanks @waitingkuo -- I plan to review this proposal more carefully later today or tomorrow. cc @avantgardnerio and @andygrove and @ovr who I think have been thinking about /working on timestamp related things as well
(I will respond individually to the parts of the proposal)
Timestamp Proposal
- make timestamp xxx work like postgresql does
- add timestamp with time zone, i believe there're lots of works and discussions to do:...
- make the local time zone as UTC by default
- add set time zone to xxx to change the local time zone
I think this sounds like a great idea 👍
Date Proposal
- I don't think there're any issues for Date
- We could consider add another 2 ISO 8601 formats (i.e. 19990108 and 990108) Chrono strictly follows ISO 8601. I think supporting all 8601 date formats makes sense.
I agree
Time Proposal
- I personally never used time with time zone. I have no clue when we need it. arrow-rs's time datatype contains no timezone. Perhaps we need not to implement this.
- let's wait for https://github.com/apache/arrow-datafusion/pull/3010
I agree -- until there is a compelling reason for time (rather than timestamp) it seems reasonable to postpone additional work on this
Interval Proposal
- Consider make INTERVAL xxx outputs Interval(MonthDayNano) instead of Interval(DayTime) as it's easier to align with our Timestamp(NanoSecond, None)
Here is the spec in case that is interesting: https://github.com/apache/arrow/blob/master/format/Schema.fbs#L354-L372
I think this is a good idea -- Interval(MonthDayNano) is a relatively new addition to the arrow standard, so I think DataFusion might default to Interval(DayTime)
is because that was all that was available at the time of initial implementation.
- Carefully design the outputs for operators like what postgresql has
Agree
- we could think about whether we really need timestamp with time zone - timestamp ... this is what postgresql has
I don't quite understand the question about "really need timestamp with time zone - timestamp" -- as I thought your proposal for timezones was to fill out the feature set for timestamp with timezone 🤔
- we could think about whether we really need timestamp with time zone - timestamp ... this is what postgresql has
I don't quite understand the question about "really need timestamp with time zone - timestamp" -- as I thought your proposal for timezones was to fill out the feature set for timestamp with timezone 🤔
i originally thought that comparing timestamp with time zone
and timestamp without time zone
is ambiguous. it turns out that postgresql simply drop the time zone and keep the native timestamp. this is quite straightforward for our implementation as well (i.e. Drop the time zone from Timestamp(TimeUnit, Optinal(Timezone))
. i'll update my proposal for this
@waitingkuo what do you think about making a list on this ticket of all the various timestamp / time related open items? I think there are a non trivial number of them, such as https://github.com/apache/arrow-datafusion/issues/194, https://github.com/apache/arrow-datafusion/issues/3103 and several others? Alternately, we can make another ticket that collects the subtasks
perhaps the reason here is the difference of resolution
There is no reason, it just isn't implemented. I implemented add for Date32 & Date64, but not Timestamp yet.
it breaks while we have hour
I believe a Date has the resolution of 24 hours, so adding 1 or 23 hours should have no effect?
INTERVAL xxx outputs Interval(MonthDayNano)
I'm wary of this because it changes Interval
from being a true Duration (fixed unit of wall clock time, e.g. milliseconds elapsed) into a Gregorian calendar unit (due to supporting Month).
2. add another 2 ISO 8601 formats
I personally lean towards only supporting one, especially if Chrono only supports one. https://xkcd.com/1179/ https://xkcd.com/927/
@waitingkuo I think 95% of what you propose is very sensible. The other 5% I don't think is incorrect, but does raise cause for consideration.
Some random thoughts:
@waitingkuo what do you think about making a list on this ticket of all the various timestamp / time related open items? I think there are a non trivial number of them, such as #194, #3103 and several others? Alternately, we can make another ticket that collects the subtasks
@alamb sure! i'll do it
it breaks while we have hour
I believe a Date has the resolution of 24 hours, so adding 1 or 23 hours should have no effect?
this issue is fixed, it works now
- add another 2 ISO 8601 formats
I personally lean towards only supporting one, especially if Chrono only supports one. https://xkcd.com/1179/ https://xkcd.com/927/
agree, the current one is the most widely used
Due to all of the above I've found it useful in OLTP systems to keep all timestamps as UTC and convert to the user locale when displaying.
I think this is best practice for all new OLTP and OLAP systems (always store data as UTC and then render into user locale as desired).
I think what makes it tougher in OLAP system is that the data being processed is often created outside of Datafusion (aka in parquet or CSV files from some other system) which can and unforuntaely do write dates / times / timestamps other than UTC
In general, I would be a big fan of trying if possible of having datafusion normlize all data on input to UTC prior to processing but I worry this might not be a good idea for performance reasons.
In general, I would be a big fan of trying if possible of having datafusion normlize all data on input to UTC prior to processing but I worry this might not be a good idea for performance reasons.
Postgrseql deal with this in the similar way https://www.postgresql.org/docs/current/datatype-datetime.html#DATATYPE-TIMEZONES
All timezone-aware dates and times are stored internally in UTC. They are converted to local time in the zone specified by the TimeZone configuration parameter before being displayed to the client.
e.g.
willy=# create table test02 (c1 timestamptz);
CREATE TABLE
willy=# insert into test02 (c1) values ('2000-01-01T00:00:00+08:00');
INSERT 0 1
willy=# insert into test02 (c1) values ('2000-01-01T00:00:00+07:00');
INSERT 0 1
willy=# insert into test02 (c1) values ('2000-01-01T23:00:00+07:00');
INSERT 0 1
willy=# select * from test02;
c1
------------------------
2000-01-01 00:00:00+08
2000-01-01 01:00:00+08
2000-01-02 00:00:00+08
(3 rows)
But it does use local time while we try to extract the day e.g.
willy=# select extract(day from c1), COUNT(*) from test02 group by extract(day from c1);
extract | count
---------+-------
1 | 2
2 | 1
(2 rows)
When we are done with the discussion, perhaps we can close this issue and use https://github.com/apache/arrow-datafusion/issues/3148 as the coordination point for development
When we are done with the discussion, perhaps we can close this issue and use #3148 as the coordination point for development
i think we could close it for now and move the discussion to #3148 and submit new ticket to discuss some big topics (e.g. timezone)
!!! Please correct me if i'm wrong !!!
Intro
Design Principle
Let's Begin with Postgresql's Date/Time
Let's Start to Compare
Timestamp
Postgresql
timestamp
andtimestamp with time zone
. (note that time zone is included in these 8 bytes)1970-01-01T00:00:00
timestamp(0)
totimestamp(6)
timestamp(0)
rounds to secondstimestamp(3)
rounds to millisecondstimestamp(6)
rounds to microsecondstimestamp 'xxx'
outputtimestamp
xxx
does't contain time zone info, it just works as what you thinkxxx
contains time zone info, time zone is just ignored. (i believe that this is a surprise for some people) e.g.timestamp with time zone 'xxx'
outputtimestamp with time
1 ifxxx
contains no time zone, it assume it's local time2 if
xxx
contains time zone, it'll be converted to your local time zoneDatafusion
Timestamp(TimeUnit, Option<String>)
TimeUnit::Second
TimeUnit::MilliSecond
TImeUnit::MicroSecond
TimeUnit::NanoSecond
1970-01-01T00:00:00
Timestamp(TimeUnit::NanoSecond, None)
timestamp
literal but notimestamp with time zone
timestamp xxx
outputsTimestamp(TimeUnit::NanoSecond, None)
xxx
contains no time zone, it automatically applies local time, parse it, convert it to utc time zone, and then drop the time zone #3080xxx
contains time zone, it's parsed correctly, then converted to utc time zone, and then drop the time zoneProposal
timestamp xxx
work like postgresql doestimestamp with time zone
, i believe there're lots of works and discussions to do: apache/arrow-rs#1936 apache/arrow-rs#1380 apache/arrow-rs#597set time zone to xxx
to change the local time zoneDate
postgresql image
Posgresql
Datafusion
1999-01-08
Proposal
Date
19990108
and990108
) Chrono strictly follows ISO 8601. I think supporting all 8601 date formats makes sense.Time
Postgresql
time xxx
outputtime
that requires 8 bytestime xxx with time zone
that requires 12 bytes, I have no idea why we need 4 more bytes here sincetimestamp with time zone
only requires 8 bytesDatafusion
time
literal for now, let's wait for #3010Proposal
time with time zone
. I have no clue when we need it.arrow-rs
'stime
datatype contains no timezone. Perhaps we need not to implement this.Interval
Postgresql
Datafusion
reference: https://github.com/apache/arrow-rs/blob/master/arrow/src/datatypes/datatype.rs#L237
Interval(IntervalUnit)
we have following units
IntervalUnit::YearMonth
IntervalUnit::DayTime
IntervalUnit::MonthDayNano
interval xxx
outputInterval(DayTime)
interval xxx
support floating number secondsif it's less than microsecond, it'll truncated
we cannot add
interval(DayTime)
toTimestamp(NanoSecond, None)
, perhaps the reason here is the difference of resolutionwe can add
interval(DayTime)
toDate
it breaks while we have hour (or other smaller units) interval #3093 this is solved
We don't have
Time
now, let's wait for #3010Proposal
INTERVAL xxx
outputs Interval(MonthDayNano) instead of Interval(DayTime) as it's easier to align with ourTimestamp(NanoSecond, None)
we could think about whether we really needWhile comparingtimestamp with time zone - timestamp
...Timestamp(TimeUnit, TimeZone)
Timestamp(TimeUnit, None)
. the one with time zone will be converted to utc and drop the timezone (it simply drop the timezone internally).this is what postgresql has