apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.48k stars 1.02k forks source link

Support `date_bin` on timestamps with timezone, properly accounting for Daylight Savings Time #10602

Open alamb opened 1 month ago

alamb commented 1 month ago

Is your feature request related to a problem or challenge?

Summary

Given this data in UTC timestamps

> select * from t_utc;
+----------------------+
| column1              |
+----------------------+
| 2024-04-30T21:30:00Z |
| 2024-04-30T22:30:00Z |
| 2024-04-30T23:30:00Z |
| 2024-05-01T00:00:00Z |
| 2024-05-01T00:30:00Z |
| 2024-05-01T10:30:00Z |
| 2024-05-01T20:30:00Z |
+----------------------+
7 row(s) fetched.
Elapsed 0.002 seconds.
How to create the data

```sql -- Dates right around the first of May create or replace table t AS VALUES ('2024-04-30T21:30:00'), ('2024-04-30T22:30:00'), ('2024-04-30T23:30:00'), ('2024-05-01T00:00:00'), ('2024-05-01T00:30:00'), ('2024-05-01T10:30:00'), ('2024-05-01T20:30:00') ; create or replace view t_utc as select column1::timestamp AT TIME ZONE 'UTC' as "column1" from t; ```

I would like to be able to get this output (bin the data by their day in Brussels, properly accounting for daylight savings time).

+---------------------------+---------------------+
| time in Brussels          | date_bin            |
+---------------------------+---------------------+
| 2024-04-30T23:30:00+02:00 | 2024-04-30T00:00:00 | <-- in 2024-04-01
| 2024-05-01T00:30:00+02:00 | 2024-05-01T00:00:00 | <-- note this is in 2024-05-01
| 2024-05-01T01:30:00+02:00 | 2024-05-01T00:00:00 |
| 2024-05-01T02:00:00+02:00 | 2024-05-01T00:00:00 |
| 2024-05-01T02:30:00+02:00 | 2024-05-01T00:00:00 |
| 2024-05-01T12:30:00+02:00 | 2024-05-01T00:00:00 |
| 2024-05-01T22:30:00+02:00 | 2024-05-01T00:00:00 |
+---------------------------+---------------------+
7 row(s) fetched.
Elapsed 0.004 seconds.

We can't figure out how to do this at the moment

Details

Broken out from @Abdullahsab3's great ticket https://github.com/apache/datafusion/issues/10368

We would like to apply date_bin and have it bin the dates in local time, properly including daylight savings time calculations

For example, given the following data at 12:01AM in the morning on the first day of each month,I would like to bin it into the first day of each month.

create or replace table t AS
VALUES
  ('2024-01-01T00:00:01'),
  ('2024-02-01T00:00:01'),
  ('2024-03-01T00:00:01'),
  ('2024-04-01T00:00:01'),
  ('2024-05-01T00:00:01'),
  ('2024-06-01T00:00:01'),
  ('2024-07-01T00:00:01'),
  ('2024-08-01T00:00:01'),
  ('2024-09-01T00:00:01'),
  ('2024-10-01T00:00:01'),
  ('2024-11-01T00:00:01'),
  ('2024-12-01T00:00:01')
;

This works as expected when there is no timestamp (all timestamps end up in the correct day)

> select arrow_typeof(column1::timestamp) as arrow_type, column1::timestamp, date_bin(interval '1 day', column1::timestamp) as "date_bin" from t;
+-----------------------------+---------------------+---------------------+
| arrow_type                  | t.column1           | date_bin            |
+-----------------------------+---------------------+---------------------+
| Timestamp(Nanosecond, None) | 2024-01-01T23:59:00 | 2024-01-01T00:00:00 | <-- binned correctly
| Timestamp(Nanosecond, None) | 2024-02-01T23:59:00 | 2024-02-01T00:00:00 |
| Timestamp(Nanosecond, None) | 2024-03-01T23:59:00 | 2024-03-01T00:00:00 |
| Timestamp(Nanosecond, None) | 2024-04-01T23:59:00 | 2024-04-01T00:00:00 |
| Timestamp(Nanosecond, None) | 2024-05-01T23:59:00 | 2024-05-01T00:00:00 |
| Timestamp(Nanosecond, None) | 2024-06-01T23:59:00 | 2024-06-01T00:00:00 |
| Timestamp(Nanosecond, None) | 2024-07-01T23:59:00 | 2024-07-01T00:00:00 |
| Timestamp(Nanosecond, None) | 2024-08-01T23:59:00 | 2024-08-01T00:00:00 |
| Timestamp(Nanosecond, None) | 2024-09-01T23:59:00 | 2024-09-01T00:00:00 |
| Timestamp(Nanosecond, None) | 2024-10-01T23:59:00 | 2024-10-01T00:00:00 |
| Timestamp(Nanosecond, None) | 2024-11-01T23:59:00 | 2024-11-01T00:00:00 |
| Timestamp(Nanosecond, None) | 2024-12-01T23:59:00 | 2024-12-01T00:00:00 |
+-----------------------------+---------------------+---------------------+
12 row(s) fetched.
Elapsed 0.004 seconds.

However, once the timestamp is in a timezone it has problems:

> create or replace view t_timezone 
as 
select column1::timestamp  AT TIME ZONE 'Europe/Brussels' as "column1" 
from t;
0 row(s) fetched.
Elapsed 0.001 seconds.

The timestamps now binned at incorrectly (the are binned based on UTC midnight, not midnight in Brussels) so for example a time in 2024-01-01 goes into a 2023-12-31 bucket:

> select
   arrow_typeof(column1) as arrow_type,
   column1,
   date_bin(interval '1 day', column1) as "date_bin"
  from t_timezone;
+------------------------------------------------+---------------------------+---------------------------+
| arrow_type                                     | column1                   | date_bin                  |
+------------------------------------------------+---------------------------+---------------------------+
| Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-01-01T00:00:01+01:00 | 2023-12-31T01:00:00+01:00 | <-- binned in previous day
| Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-02-01T00:00:01+01:00 | 2024-01-31T01:00:00+01:00 |
| Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-03-01T00:00:01+01:00 | 2024-02-29T01:00:00+01:00 |
| Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-04-01T00:00:01+02:00 | 2024-03-31T01:00:00+01:00 |
| Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-05-01T00:00:01+02:00 | 2024-04-30T02:00:00+02:00 |
| Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-06-01T00:00:01+02:00 | 2024-05-31T02:00:00+02:00 |
| Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-07-01T00:00:01+02:00 | 2024-06-30T02:00:00+02:00 |
| Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-08-01T00:00:01+02:00 | 2024-07-31T02:00:00+02:00 |
| Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-09-01T00:00:01+02:00 | 2024-08-31T02:00:00+02:00 |
| Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-10-01T00:00:01+02:00 | 2024-09-30T02:00:00+02:00 |
| Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-11-01T00:00:01+01:00 | 2024-10-31T01:00:00+01:00 |
| Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-12-01T00:00:01+01:00 | 2024-11-30T01:00:00+01:00 |
+------------------------------------------------+---------------------------+---------------------------+

The third argument to date_bin allows changing the starting offset for a day, so let's set it to midnight in Europe/Brussels time. This almost works but does not take into account daylight savings time so the timestamps are binned incorrectly during DST:

> select
   arrow_typeof(column1) as arrow_type,
   column1,
   date_bin(interval '1 day', column1, '2020-01-01T00:00:00Z'::timestamp AT TIME ZONE 'Europe/Brussels') as "date_bin"
  from t_timezone;
+------------------------------------------------+---------------------------+---------------------------+
| arrow_type                                     | column1                   | date_bin                  |
+------------------------------------------------+---------------------------+---------------------------+
| Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-01-01T00:00:01+01:00 | 2024-01-01T00:00:00+01:00 | <-- correct!
| Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-02-01T00:00:01+01:00 | 2024-02-01T00:00:00+01:00 |
| Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-03-01T00:00:01+01:00 | 2024-03-01T00:00:00+01:00 |
| Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-04-01T00:00:01+02:00 | 2024-03-31T00:00:00+01:00 | <-- DST! Binned into previous day
| Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-05-01T00:00:01+02:00 | 2024-04-30T01:00:00+02:00 |
| Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-06-01T00:00:01+02:00 | 2024-05-31T01:00:00+02:00 |
| Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-07-01T00:00:01+02:00 | 2024-06-30T01:00:00+02:00 |
| Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-08-01T00:00:01+02:00 | 2024-07-31T01:00:00+02:00 |
| Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-09-01T00:00:01+02:00 | 2024-08-31T01:00:00+02:00 |
| Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-10-01T00:00:01+02:00 | 2024-09-30T01:00:00+02:00 |
| Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-11-01T00:00:01+01:00 | 2024-11-01T00:00:00+01:00 | <-- DST over, now correct again
| Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-12-01T00:00:01+01:00 | 2024-12-01T00:00:00+01:00 |
+------------------------------------------------+---------------------------+---------------------------+
12 row(s) fetched.
Elapsed 0.003 seconds.

Describe the solution you'd like

Describe alternatives you've considered

No response

Additional context

No response

alamb commented 1 month ago

The way you can perform this binning in postgres is somewhat paradoxically to convert a timestamp with a timezone back to a timestamp without timezone and then apply date_bin.

The syntax to convert a timestamp to something without timezone is 🤯 : you apply AT TIME ZONE to a timestamp with a timezone and then it has no timezone) as @mhilton notes in https://github.com/apache/datafusion/issues/10368#issuecomment-2111684373


postgres=# select pg_typeof('2024-05-21T12:00:00'::timestamp AT TIME ZONE 'America/New_York');
        pg_typeof
--------------------------
 timestamp with time zone
(1 row)

postgres=# select pg_typeof('2024-05-21T12:00:00'::timestamp AT TIME ZONE 'America/New_York' AT TIME ZONE 'Europe/Brussels'
);
          pg_typeof
-----------------------------
 timestamp without time zone. <-- WTF no timezone as the result of AT TIME ZONE
(1 row)
alamb commented 1 month ago

If we cast using arrow_cast back to Timestamp(Nanosecond, None) the binning does appear to work correctly

> create or replace view t_roundtrip as select arrow_cast(column1, 'Timestamp(Nanosecond, None)') as "column1" from t;
0 row(s) fetched.
Elapsed 0.002 seconds.

> select * from t_roundtrip;
+---------------------+
| column1             |
+---------------------+
| 2024-01-01T00:00:01 | <--- timestamps now have no timezone
| 2024-02-01T00:00:01 |
| 2024-03-01T00:00:01 |
| 2024-04-01T00:00:01 |
| 2024-05-01T00:00:01 |
| 2024-06-01T00:00:01 |
| 2024-07-01T00:00:01 |
| 2024-08-01T00:00:01 |
| 2024-09-01T00:00:01 |
| 2024-10-01T00:00:01 |
| 2024-11-01T00:00:01 |
| 2024-12-01T00:00:01 |
+---------------------+
12 row(s) fetched.
Elapsed 0.001 seconds.

> select arrow_typeof(column1) as arrow_type, column1, date_bin(interval '1 day', column1) as "date_bin" from t_roundtrip;
+-----------------------------+---------------------+---------------------+
| arrow_type                  | column1             | date_bin            |
+-----------------------------+---------------------+---------------------+
| Timestamp(Nanosecond, None) | 2024-01-01T00:00:01 | 2024-01-01T00:00:00 | <-- bins are as desired
| Timestamp(Nanosecond, None) | 2024-02-01T00:00:01 | 2024-02-01T00:00:00 |
| Timestamp(Nanosecond, None) | 2024-03-01T00:00:01 | 2024-03-01T00:00:00 |
| Timestamp(Nanosecond, None) | 2024-04-01T00:00:01 | 2024-04-01T00:00:00 |
| Timestamp(Nanosecond, None) | 2024-05-01T00:00:01 | 2024-05-01T00:00:00 |
| Timestamp(Nanosecond, None) | 2024-06-01T00:00:01 | 2024-06-01T00:00:00 |
| Timestamp(Nanosecond, None) | 2024-07-01T00:00:01 | 2024-07-01T00:00:00 |
| Timestamp(Nanosecond, None) | 2024-08-01T00:00:01 | 2024-08-01T00:00:00 |
| Timestamp(Nanosecond, None) | 2024-09-01T00:00:01 | 2024-09-01T00:00:00 |
| Timestamp(Nanosecond, None) | 2024-10-01T00:00:01 | 2024-10-01T00:00:00 |
| Timestamp(Nanosecond, None) | 2024-11-01T00:00:01 | 2024-11-01T00:00:00 |
| Timestamp(Nanosecond, None) | 2024-12-01T00:00:01 | 2024-12-01T00:00:00 |
+-----------------------------+---------------------+---------------------+
12 row(s) fetched.
Elapsed 0.003 seconds.
alamb commented 1 month ago

Given the statement in the description, here is the best I can come up with using arrow_cast

-- Times in brussels
WITH t_brussels
AS (
  SELECT
    column1 AT TIME ZONE 'Europe/Brussels' as ts -- timestamp in specified timezone
  FROM t_utc
)
SELECT
  ts as "time in Brussels",
  date_bin(
    interval '1 day',
    arrow_cast(ts, 'Timestamp(Nanosecond, None)'),
    '2020-01-01T00:00:00Z'::timestamp
  ) as date_bin
FROM
  t_brussels;

+---------------------------+---------------------+
| time in Brussels          | date_bin            |
+---------------------------+---------------------+
| 2024-04-30T23:30:00+02:00 | 2024-04-30T00:00:00 |
| 2024-05-01T00:30:00+02:00 | 2024-04-30T00:00:00 | <-- this is in the wrong bin
| 2024-05-01T01:30:00+02:00 | 2024-04-30T00:00:00 | <-- this in in the wrong bin 
| 2024-05-01T02:00:00+02:00 | 2024-05-01T00:00:00 |
| 2024-05-01T02:30:00+02:00 | 2024-05-01T00:00:00 |
| 2024-05-01T12:30:00+02:00 | 2024-05-01T00:00:00 |
| 2024-05-01T22:30:00+02:00 | 2024-05-01T00:00:00 |
+---------------------------+---------------------+
7 row(s) fetched.
Elapsed 0.004 seconds.
alamb commented 1 month ago

@mhilton and I agree that if we had the functionality suggested by @Abdullahsab3's on https://github.com/apache/datafusion/issues/10368#issue-2277903243

given a UTC timestamp, I would like to have that timestamp in local time of a given timezone

We think we could get the right value out of date_bin.

For example, If we had a function like remove_timezone that behaved like

select remove_timezone("time in Brussels")

+---------------------------+---------------------+
| time in Brussels          | remove_timezone            |
+---------------------------+---------------------+
| 2024-04-30T23:30:00+02:00 | 2024-04-30T23:30:00 |
| 2024-05-01T00:30:00+02:00 | 2024-05-01T00:30:00 | <-- timezone removed (no +02:00, but value not adjusted)
| 2024-05-01T01:30:00+02:00 | 2024-05-01T01:30:00 |
| 2024-05-01T02:00:00+02:00 | 2024-05-01T00:00:00 |
| 2024-05-01T02:30:00+02:00 | 2024-05-01T02:30:00 |
| 2024-05-01T12:30:00+02:00 | 2024-05-01T12:30:00 |
| 2024-05-01T22:30:00+02:00 | 2024-05-01T22:30:00 |
+---------------------------+---------------------+

Note that this is different than arrow_cast("time in Brussels", 'Timestamp(Nanosecond, None)') because arrow_cast will adjust the timestamp values

-- Times in brussels
WITH t_brussels
AS (
  SELECT
    column1 AT TIME ZONE 'Europe/Brussels' as ts -- timestamp in specified timezone
  FROM t_utc
)
SELECT
  ts as "time in Brussels",
  arrow_cast(ts, 'Timestamp(Nanosecond, None)') as arrow_cast
FROM
  t_brussels;

+---------------------------+---------------------+
| time in Brussels          | arrow_cast          |
+---------------------------+---------------------+
| 2024-04-30T23:30:00+02:00 | 2024-04-30T21:30:00 |
| 2024-05-01T00:30:00+02:00 | 2024-04-30T22:30:00 | <-- note this is now in 2024-04-30 not 2024-05-01
| 2024-05-01T01:30:00+02:00 | 2024-04-30T23:30:00 |
| 2024-05-01T02:00:00+02:00 | 2024-05-01T00:00:00 |
| 2024-05-01T02:30:00+02:00 | 2024-05-01T00:30:00 |
| 2024-05-01T12:30:00+02:00 | 2024-05-01T10:30:00 |
| 2024-05-01T22:30:00+02:00 | 2024-05-01T20:30:00 |
+---------------------------+---------------------+
7 row(s) fetched.
Elapsed 0.003 seconds.
alamb commented 1 month ago

My suggested next steps for this ticket:

  1. Someone prototype the "strip_timezone" function as a ScalarUDF and verify that we can in fact we can achieve the expected result from date_bin
  2. If we can achieve the expected results, then file another ticket / proposal with how to integrate this into DataFusion (e.g. strip_timezone function, and/or update the arrow_cast kernel, and/or add syntax like WITHOUT TIME ZONE)
appletreeisyellow commented 1 month ago

I'd like to work on this issue 🙋‍♀️

tustvold commented 1 month ago

The challenge with an approach that relies on stripping the timezone, is you end up back with the ambiguous timestamp issue. The UX of relying on users to strip the timezone is also rather unfortunate. I'm curious as to why we wouldn't want to just make date_bin timezone aware? This would appear to have a lot less potential edge cases

alamb commented 1 month ago

I'm curious as to why we wouldn't want to just make date_bin timezone aware? This would appear to have a lot less potential edge cases

I believe @mhilton had thought about this and maybe has some thoughts

mhilton commented 1 month ago

I'm curious as to why we wouldn't want to just make date_bin timezone aware?

I'm not sure that it's obvious how a time-zone aware date_bin would behave. For example On the day in october when many time zones repeat an hour would the two hours count into a single bin? What about if you had hour long bins that start 30 minutes into the hour? Given that the two hours are easily distinguishable the answer is probably not, and if you really want that behviour then stripping time zone information would be a reasonable way to get it.

If we restricted timezone awareness to intervals of unit days then that would be much more reasonable. We could make it so that date_bin understands the number of hours in any given day in a time zone and bins a time accordingly. Anyone needing a strict 24 hour binning could specify the interval in hour units.

tustvold commented 1 month ago

zones repeat an hour would the two hours count into a single bin

This would be the intuitive thing for it to do IMO, if people don't want this, they should cast to a timezone that doesn't have DST

appletreeisyellow commented 1 month ago

To make date_bin timezone aware, there are some edge cases we need to consider when design it:

  1. Daylight Saving Time (DST) Transitions:
    • Spring Forward: When the clocks move forward, there is a "missing" hour.
    • Fall Back: When the clocks move backward, there is an "extra" hour. For example, in US central time zone, when DST ends at 2:00 AM, the clocks are set back to 1:00 AM. This means that there are two 1:20 AM in a day. If a user do a date_bin with 10 min interval, how to handle the returned data? Aggregate them? Return two sets of data?
  2. Crossing Midnight Boundaries:
    • Ensure that timestamps are correctly binned at the right local day and hour boundaries, especially when converting from UTC to a local timezone. This edge case is what this issue is trying to solve.
  3. Timezone Offsets:
    • Different timezones have different offsets from UTC, and these can change over time (e.g., due to DST).
    • For certain timezones, ensure the offset works correctly at the 30-min mark when DST happens. e.g.:
      • Lord Howe Island Time Zone (LHST) in Australia shifts time by 30 min during DST
      • Nepal (NPT) shifts by 45 min
      • Chatham Islands, New Zealand (CHAST) shifts by 12 hr and 45 min.
    • Ensure the function correctly applies the current offset for the given timestamp, considering historical changes in timezone rules. e.g.:
      • America/Sao_Paulo stoped doing DST in 2019
      • Russia set the clocks ahead permanently in 2011, then was reversed in 2014
      • Iceland, Turkey, Egypt, India, and China
  4. Leap Years and Leap Seconds:
    • Leap Years: February 29th is handled correctly in leap years.
    • Leap Seconds: this is less common. e.g. the most recent one was on Dec. 31, 2016, a leap second was introduced at 23:59:60 UTC
  5. Handling Null and Invalid Timestamps:
    • Ensure that null or invalid timestamps are handled gracefully, either by ignoring them or providing a default bin. e.g. '2021-03-28T02:30:00' AT TIME ZONE 'Europe/Brussels' does not exist
  6. Time Precision:
    • Ensure that the function handles different precisions of timestamps correctly (e.g., seconds, milliseconds, nanoseconds).
appletreeisyellow commented 1 month ago
  • Fall Back: When the clocks move backward, there is an "extra" hour. For example, in US central time zone, when DST ends at 2:00 AM, the clocks are set back to 1:00 AM. This means that there are two 1:20 AM in a day. If a user do a date_bin with 10 min interval, how to handle the returned data? Aggregate them? Return two sets of data?

I want to highlight this open-ended question and hear how others think

Abdullahsab3 commented 1 month ago

Thanks for filing the ticket and for all the detailed explanations! very enriching

I wonder whether the Postgres behavior is actually that bad. Though it looks weird, it still is generic enough to make it widely applicable. The issue with making date_bin specifically timezone-aware in my opinion is the fact that future or similar grouping functions (or rolling windows functions) will also have to be implemented in a timezone-aware way, accounting for similar pitfalls and edge cases. The same problem will also occur in other functionalities; for example if you would like to return local time. I personally think that the problem might be better tackled as an additional time functionality, which may be the same way that Postgres does it.

The postgres way of converting UTC to local time of a given timezone is:

select '2024-05-21T12:00:00Z'::timestamp AT TIME ZONE 'UTC' AT TIME ZONE 'Europe/Brussels';

timezone
--
2024-05-21 14:00:00

As already mentioned by Andrew, the AT TIME ZONE operator in postgres converts a timezone-aware timestamp to local time (with no offset or timezone information), and local time to a timezone-aware timestamp. Though the overloaded functionalities of the AT TIME ZONE operator in Postgres are weird, they are definitely usable in my opinion. I also like the idea of having a separate to_local_time/strip_timezone function, though this would break Datafusions compatibility with Postgres.

alamb commented 1 month ago

Thank you @tustvold and @Abdullahsab3 and @mhilton and @appletreeisyellow for the thoughts.

From my perspective, the current (non timezone aware) date_bin function has the benefit of being (relatively) simple to implement and fast (as it is some arithmetic calculation on the underlying integer value without having to do DST conversions per row)

Given the differences in underlying timestamp representation between arrow and postgres I do think some difference is likely inevitable and thus likely not a deal breaker.

Here are my suggested next steps @appletreeisyellow tries to prototype one or both proposals and see if we can get it to produce the desired results:

  1. create a to_local_time function
  2. Modify to the date_bin function to make it timezone aware

I think the to_local_time might be the simplest approach.

Abdullahsab3 commented 1 month ago

I was looking into this issue again. I am still validating this, but I found that using to_char(time AT TIME ZONE 'UTC' AT TIME ZONE 'Europe/Brussels', '%F %X') also results in correct local time. (see https://docs.rs/chrono/latest/chrono/format/strftime/index.html)

appletreeisyellow commented 1 month ago

but I found that using to_char(time AT TIME ZONE 'UTC' AT TIME ZONE 'Europe/Brussels', '%F %X') also results in correct local time.

@Abdullahsab3 Nice find! Seems like using to_char() matches what you described in https://github.com/apache/datafusion/issues/10368 for to_local_time(). I'm curious if you are satisfied with to_char() solution? Or is there anything that to_char() cannot achieve?

I had been prototyping what @alamb suggested above since yesterday. I was able to do step 1. Note that the to_local_time function here is different from the one suggested in https://github.com/apache/datafusion/issues/10368. I was about to do step 2, and saw your message and want to check if we need to re-evaluate this issue

appletreeisyellow commented 1 month ago

I am able to verified that to_char() works in date_bin() on timestamp with timezone and daylight saving time 🎉

select
   arrow_typeof(column1) as arrow_type,
   column1,
   date_bin(interval '1 day', to_char(column1 AT TIME ZONE 'UTC' AT TIME ZONE 'Europe/Brussels', '%F %X')) as "date_bin"
from t_timezone;
+------------------------------------------------+---------------------------+---------------------+
| arrow_type                                     | column1                   | date_bin            |
+------------------------------------------------+---------------------------+---------------------+
| Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-01-01T00:00:01+01:00 | 2024-01-01T00:00:00 | <-- binned correctly
| Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-02-01T00:00:01+01:00 | 2024-02-01T00:00:00 |
| Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-03-01T00:00:01+01:00 | 2024-03-01T00:00:00 |
| Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-04-01T00:00:01+02:00 | 2024-04-01T00:00:00 |  <-- binned correctly on daylight savings time
| Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-05-01T00:00:01+02:00 | 2024-05-01T00:00:00 |
| Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-06-01T00:00:01+02:00 | 2024-06-01T00:00:00 |
| Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-07-01T00:00:01+02:00 | 2024-07-01T00:00:00 |
| Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-08-01T00:00:01+02:00 | 2024-08-01T00:00:00 |
| Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-09-01T00:00:01+02:00 | 2024-09-01T00:00:00 |
| Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-10-01T00:00:01+02:00 | 2024-10-01T00:00:00 |
| Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-11-01T00:00:01+01:00 | 2024-11-01T00:00:00 |
| Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-12-01T00:00:01+01:00 | 2024-12-01T00:00:00 |
+------------------------------------------------+---------------------------+---------------------+
12 row(s) fetched.
Elapsed 0.024 seconds.
select
  column1 as "time in UTC",
  column1 AT TIME ZONE 'UTC' AT TIME ZONE 'Europe/Brussels' as "time in Brussels",
  date_bin(interval '1 day', to_char(column1 AT TIME ZONE 'UTC' AT TIME ZONE 'Europe/Brussels', '%F %X')) as "date_bin"
from t_utc;
+----------------------+---------------------------+---------------------+
| time in UTC          | time in Brussels          | date_bin            |
+----------------------+---------------------------+---------------------+
| 2024-04-30T21:30:00Z | 2024-04-30T23:30:00+02:00 | 2024-04-30T00:00:00 |
| 2024-04-30T22:30:00Z | 2024-05-01T00:30:00+02:00 | 2024-05-01T00:00:00 | <-- binned correctly
| 2024-04-30T23:30:00Z | 2024-05-01T01:30:00+02:00 | 2024-05-01T00:00:00 |
| 2024-05-01T00:00:00Z | 2024-05-01T02:00:00+02:00 | 2024-05-01T00:00:00 |
| 2024-05-01T00:30:00Z | 2024-05-01T02:30:00+02:00 | 2024-05-01T00:00:00 |
| 2024-05-01T10:30:00Z | 2024-05-01T12:30:00+02:00 | 2024-05-01T00:00:00 |
| 2024-05-01T20:30:00Z | 2024-05-01T22:30:00+02:00 | 2024-05-01T00:00:00 |
+----------------------+---------------------------+---------------------+
7 row(s) fetched.
Elapsed 0.023 seconds.
How to create the data
The following code were given by Andrew from [here](https://github.com/apache/datafusion/issues/10602#issue-2308485729). Copy here again for easier review. ```sql -- Dates right around the first of May create or replace table t AS VALUES ('2024-04-30T21:30:00'), ('2024-04-30T22:30:00'), ('2024-04-30T23:30:00'), ('2024-05-01T00:00:00'), ('2024-05-01T00:30:00'), ('2024-05-01T10:30:00'), ('2024-05-01T20:30:00') ; create or replace view t_utc as select column1::timestamp AT TIME ZONE 'UTC' as "column1" from t; ``` ```sql create or replace table t AS VALUES ('2024-01-01T00:00:01'), ('2024-02-01T00:00:01'), ('2024-03-01T00:00:01'), ('2024-04-01T00:00:01'), ('2024-05-01T00:00:01'), ('2024-06-01T00:00:01'), ('2024-07-01T00:00:01'), ('2024-08-01T00:00:01'), ('2024-09-01T00:00:01'), ('2024-10-01T00:00:01'), ('2024-11-01T00:00:01'), ('2024-12-01T00:00:01') ; create or replace view t_timezone as select column1::timestamp AT TIME ZONE 'Europe/Brussels' as "column1" from t; ```
Abdullahsab3 commented 1 month ago

Very nice! Thanks for verifying @appletreeisyellow. I was able to do some minimal validations today on our systems and it seems indeed to be binning correctly. Happy to see this as an alternative to what I was doing as a fix for this issue

My main concerns with the use of to_char are:

alamb commented 1 month ago

@appletreeisyellow and I just had a chat about what to do about this ticket. It is great that we seem to have found a workaround there are at least two potential issues we see with the workaround:

  1. It is non obvious and likely would not be obvious to other future users
  2. It may have performance issues, as @Abdullahsab3 notes in https://github.com/apache/datafusion/issues/10602#issuecomment-2140461400

We think it would be great to improve DataFusion so future users (both of InfluxDB and other systems) who use date_bin have fast queries and are not confused.

Ideal Behavior

We think the ideal user experience would be as @tustvold suggests for date_bin to be timezone aware and just "do the right thing and bin dates respecting the timezone of the timestamp".

While this might be different than what postgres does, the underlying treatment if timestamps is already different, so I don't think we need to be quirk for quirk compatible.

Proposal

Thus we propose @appletreeisyellow will try and complete her prototype work to make date_bin do the intuitive thing with timestamps with timezones

It is likely we can't devote a large amount of time to this project, so if the implementation is taking too long we may have to abandon this effort or go with a less optimal route (e.g. a function to strip off timezones like Postgres)

alamb commented 1 month ago

I filed https://github.com/apache/arrow-rs/pull/5826 to document more precisely what is going on with casting

appletreeisyellow commented 3 weeks ago

I wrote up a high-level doc on how to implement the date_bin with timezone awareness. If you cannot access the document, feel free to request access and I'll give you the permission to view/comment.

Abdullahsab3 commented 3 weeks ago

Another thing that needs to be taken into consideration: I think the returned timestamp from date_bin needs to be in the timezone that was provided as an argument (in the internal Arrow represenation). For example: The returned timestamp is in correct local time, but the underlying Arrow kernel has no timezone information about it (which I think means that the timestamp will be interpreted as being in UTC)

❯ select to_char('2024-04-30T02:00:00' at time zone 'UTC' at time zone 'Europe/Brussels', '%F %X')::timestamp;
+----------------------------------------------------+
| to_char(Utf8("2024-04-30T02:00:00"),Utf8("%F %X")) |
+----------------------------------------------------+
| 2024-04-30T04:00:00                                |
+----------------------------------------------------+
1 row in set. Query took 0.002 seconds.

❯ select arrow_typeof(to_char('2024-04-30T02:00:00' at time zone 'UTC' at time zone 'Europe/Brussels', '%F %X')::timestamp);
+------------------------------------------------------------------+
| arrow_typeof(to_char(Utf8("2024-04-30T02:00:00"),Utf8("%F %X"))) |
+------------------------------------------------------------------+
| Timestamp(Nanosecond, None)                                      |
+------------------------------------------------------------------+
1 row in set. Query took 0.003 seconds.

When we add the timezone information:

❯ select to_char('2024-04-30T02:00:00' at time zone 'UTC' at time zone 'Europe/Brussels', '%F %X') at time zone 'Europe/Brussels';
+----------------------------------------------------+
| to_char(Utf8("2024-04-30T02:00:00"),Utf8("%F %X")) |
+----------------------------------------------------+
| 2024-04-30T04:00:00+02:00                          |
+----------------------------------------------------+
1 row in set. Query took 0.001 seconds.

❯ select arrow_typeof(to_char('2024-04-30T02:00:00' at time zone 'UTC' at time zone 'Europe/Brussels', '%F %X') at time zone 'Europe/Brussels');
+------------------------------------------------------------------+
| arrow_typeof(to_char(Utf8("2024-04-30T02:00:00"),Utf8("%F %X"))) |
+------------------------------------------------------------------+
| Timestamp(Nanosecond, Some("Europe/Brussels"))                   |
+------------------------------------------------------------------+
1 row in set. Query took 0.001 seconds.

The timezone information there is important, as the timestamp will be otherwise interpreted by external integrations (such as Grafana) as being in UTC.

Note that applying at time zone 'Europe/Brussels' on the results of to_char will never fail during daylight savings hours, as the returned value from to_char is always in localtime, meaning that the 'missing/additional hour' should not be an issue for the at time zone operator.

TLDR; if you're using the to_char workaround now for localisation, you should add the timezone information to the results

appletreeisyellow commented 3 weeks ago

I think the returned timestamp from date_bin needs to be in the timezone that was provided as an argument (in the internal Arrow represenation)

Thanks for highlighting it. Yes, this is the expected behavior. The timezone representation of the returned timestamp will be the same as the input timestamp. For example:

select arrow_typeof(date_bin(interval '1 day', '2024-03-30T00:00:20')); +--------------------------------------------------------------------------------------------------+ | arrow_typeof(date_bin(IntervalMonthDayNano("18446744073709551616"),Utf8("2024-03-30T00:00:20"))) | +--------------------------------------------------------------------------------------------------+ | Timestamp(Nanosecond, None) | +--------------------------------------------------------------------------------------------------+


- if the input has a timezone in `UTC`, then the date_bin returns timezone as `Some('UTC')`
```sql
select arrow_typeof(date_bin(interval '1 day', '2024-03-30T00:00:20Z' AT TIME ZONE 'UTC'));
+---------------------------------------------------------------------------------------------------+
| arrow_typeof(date_bin(IntervalMonthDayNano("18446744073709551616"),Utf8("2024-03-30T00:00:20Z"))) |
+---------------------------------------------------------------------------------------------------+
| Timestamp(Nanosecond, Some("UTC"))                                                                |
+---------------------------------------------------------------------------------------------------+

select arrow_typeof(date_bin(interval '1 day', '2024-03-30T00:00:20Z' AT TIME ZONE 'Europe/Brussels')); +---------------------------------------------------------------------------------------------------+ | arrow_typeof(date_bin(IntervalMonthDayNano("18446744073709551616"),Utf8("2024-03-30T00:00:20Z"))) | +---------------------------------------------------------------------------------------------------+ | Timestamp(Nanosecond, Some("Europe/Brussels")) | +---------------------------------------------------------------------------------------------------+