Open alamb opened 6 months ago
The way you can perform this binning in postgres is somewhat paradoxically to convert a timestamp with a timezone back to a timestamp without timezone and then apply date_bin
.
The syntax to convert a timestamp to something without timezone is 🤯 : you apply AT TIME ZONE
to a timestamp with a timezone and then it has no timezone) as @mhilton notes in https://github.com/apache/datafusion/issues/10368#issuecomment-2111684373
postgres=# select pg_typeof('2024-05-21T12:00:00'::timestamp AT TIME ZONE 'America/New_York');
pg_typeof
--------------------------
timestamp with time zone
(1 row)
postgres=# select pg_typeof('2024-05-21T12:00:00'::timestamp AT TIME ZONE 'America/New_York' AT TIME ZONE 'Europe/Brussels'
);
pg_typeof
-----------------------------
timestamp without time zone. <-- WTF no timezone as the result of AT TIME ZONE
(1 row)
If we cast using arrow_cast back to Timestamp(Nanosecond, None)
the binning does appear to work correctly
> create or replace view t_roundtrip as select arrow_cast(column1, 'Timestamp(Nanosecond, None)') as "column1" from t;
0 row(s) fetched.
Elapsed 0.002 seconds.
> select * from t_roundtrip;
+---------------------+
| column1 |
+---------------------+
| 2024-01-01T00:00:01 | <--- timestamps now have no timezone
| 2024-02-01T00:00:01 |
| 2024-03-01T00:00:01 |
| 2024-04-01T00:00:01 |
| 2024-05-01T00:00:01 |
| 2024-06-01T00:00:01 |
| 2024-07-01T00:00:01 |
| 2024-08-01T00:00:01 |
| 2024-09-01T00:00:01 |
| 2024-10-01T00:00:01 |
| 2024-11-01T00:00:01 |
| 2024-12-01T00:00:01 |
+---------------------+
12 row(s) fetched.
Elapsed 0.001 seconds.
> select arrow_typeof(column1) as arrow_type, column1, date_bin(interval '1 day', column1) as "date_bin" from t_roundtrip;
+-----------------------------+---------------------+---------------------+
| arrow_type | column1 | date_bin |
+-----------------------------+---------------------+---------------------+
| Timestamp(Nanosecond, None) | 2024-01-01T00:00:01 | 2024-01-01T00:00:00 | <-- bins are as desired
| Timestamp(Nanosecond, None) | 2024-02-01T00:00:01 | 2024-02-01T00:00:00 |
| Timestamp(Nanosecond, None) | 2024-03-01T00:00:01 | 2024-03-01T00:00:00 |
| Timestamp(Nanosecond, None) | 2024-04-01T00:00:01 | 2024-04-01T00:00:00 |
| Timestamp(Nanosecond, None) | 2024-05-01T00:00:01 | 2024-05-01T00:00:00 |
| Timestamp(Nanosecond, None) | 2024-06-01T00:00:01 | 2024-06-01T00:00:00 |
| Timestamp(Nanosecond, None) | 2024-07-01T00:00:01 | 2024-07-01T00:00:00 |
| Timestamp(Nanosecond, None) | 2024-08-01T00:00:01 | 2024-08-01T00:00:00 |
| Timestamp(Nanosecond, None) | 2024-09-01T00:00:01 | 2024-09-01T00:00:00 |
| Timestamp(Nanosecond, None) | 2024-10-01T00:00:01 | 2024-10-01T00:00:00 |
| Timestamp(Nanosecond, None) | 2024-11-01T00:00:01 | 2024-11-01T00:00:00 |
| Timestamp(Nanosecond, None) | 2024-12-01T00:00:01 | 2024-12-01T00:00:00 |
+-----------------------------+---------------------+---------------------+
12 row(s) fetched.
Elapsed 0.003 seconds.
Given the statement in the description, here is the best I can come up with using arrow_cast
-- Times in brussels
WITH t_brussels
AS (
SELECT
column1 AT TIME ZONE 'Europe/Brussels' as ts -- timestamp in specified timezone
FROM t_utc
)
SELECT
ts as "time in Brussels",
date_bin(
interval '1 day',
arrow_cast(ts, 'Timestamp(Nanosecond, None)'),
'2020-01-01T00:00:00Z'::timestamp
) as date_bin
FROM
t_brussels;
+---------------------------+---------------------+
| time in Brussels | date_bin |
+---------------------------+---------------------+
| 2024-04-30T23:30:00+02:00 | 2024-04-30T00:00:00 |
| 2024-05-01T00:30:00+02:00 | 2024-04-30T00:00:00 | <-- this is in the wrong bin
| 2024-05-01T01:30:00+02:00 | 2024-04-30T00:00:00 | <-- this in in the wrong bin
| 2024-05-01T02:00:00+02:00 | 2024-05-01T00:00:00 |
| 2024-05-01T02:30:00+02:00 | 2024-05-01T00:00:00 |
| 2024-05-01T12:30:00+02:00 | 2024-05-01T00:00:00 |
| 2024-05-01T22:30:00+02:00 | 2024-05-01T00:00:00 |
+---------------------------+---------------------+
7 row(s) fetched.
Elapsed 0.004 seconds.
@mhilton and I agree that if we had the functionality suggested by @Abdullahsab3's on https://github.com/apache/datafusion/issues/10368#issue-2277903243
given a UTC timestamp, I would like to have that timestamp in local time of a given timezone
We think we could get the right value out of date_bin
.
For example, If we had a function like remove_timezone
that behaved like
select remove_timezone("time in Brussels")
+---------------------------+---------------------+
| time in Brussels | remove_timezone |
+---------------------------+---------------------+
| 2024-04-30T23:30:00+02:00 | 2024-04-30T23:30:00 |
| 2024-05-01T00:30:00+02:00 | 2024-05-01T00:30:00 | <-- timezone removed (no +02:00, but value not adjusted)
| 2024-05-01T01:30:00+02:00 | 2024-05-01T01:30:00 |
| 2024-05-01T02:00:00+02:00 | 2024-05-01T00:00:00 |
| 2024-05-01T02:30:00+02:00 | 2024-05-01T02:30:00 |
| 2024-05-01T12:30:00+02:00 | 2024-05-01T12:30:00 |
| 2024-05-01T22:30:00+02:00 | 2024-05-01T22:30:00 |
+---------------------------+---------------------+
Note that this is different than arrow_cast("time in Brussels", 'Timestamp(Nanosecond, None)')
because arrow_cast
will adjust the timestamp values
-- Times in brussels
WITH t_brussels
AS (
SELECT
column1 AT TIME ZONE 'Europe/Brussels' as ts -- timestamp in specified timezone
FROM t_utc
)
SELECT
ts as "time in Brussels",
arrow_cast(ts, 'Timestamp(Nanosecond, None)') as arrow_cast
FROM
t_brussels;
+---------------------------+---------------------+
| time in Brussels | arrow_cast |
+---------------------------+---------------------+
| 2024-04-30T23:30:00+02:00 | 2024-04-30T21:30:00 |
| 2024-05-01T00:30:00+02:00 | 2024-04-30T22:30:00 | <-- note this is now in 2024-04-30 not 2024-05-01
| 2024-05-01T01:30:00+02:00 | 2024-04-30T23:30:00 |
| 2024-05-01T02:00:00+02:00 | 2024-05-01T00:00:00 |
| 2024-05-01T02:30:00+02:00 | 2024-05-01T00:30:00 |
| 2024-05-01T12:30:00+02:00 | 2024-05-01T10:30:00 |
| 2024-05-01T22:30:00+02:00 | 2024-05-01T20:30:00 |
+---------------------------+---------------------+
7 row(s) fetched.
Elapsed 0.003 seconds.
My suggested next steps for this ticket:
date_bin
strip_timezone
function, and/or update the arrow_cast
kernel, and/or add syntax like WITHOUT TIME ZONE
) I'd like to work on this issue 🙋♀️
The challenge with an approach that relies on stripping the timezone, is you end up back with the ambiguous timestamp issue. The UX of relying on users to strip the timezone is also rather unfortunate. I'm curious as to why we wouldn't want to just make date_bin timezone aware? This would appear to have a lot less potential edge cases
I'm curious as to why we wouldn't want to just make date_bin timezone aware? This would appear to have a lot less potential edge cases
I believe @mhilton had thought about this and maybe has some thoughts
I'm curious as to why we wouldn't want to just make date_bin timezone aware?
I'm not sure that it's obvious how a time-zone aware date_bin would behave. For example On the day in october when many time zones repeat an hour would the two hours count into a single bin? What about if you had hour long bins that start 30 minutes into the hour? Given that the two hours are easily distinguishable the answer is probably not, and if you really want that behviour then stripping time zone information would be a reasonable way to get it.
If we restricted timezone awareness to intervals of unit days then that would be much more reasonable. We could make it so that date_bin understands the number of hours in any given day in a time zone and bins a time accordingly. Anyone needing a strict 24 hour binning could specify the interval in hour units.
zones repeat an hour would the two hours count into a single bin
This would be the intuitive thing for it to do IMO, if people don't want this, they should cast to a timezone that doesn't have DST
To make date_bin
timezone aware, there are some edge cases we need to consider when design it:
date_bin
with 10 min interval, how to handle the returned data? Aggregate them? Return two sets of data?'2021-03-28T02:30:00' AT TIME ZONE 'Europe/Brussels'
does not exist
- Fall Back: When the clocks move backward, there is an "extra" hour. For example, in US central time zone, when DST ends at 2:00 AM, the clocks are set back to 1:00 AM. This means that there are two 1:20 AM in a day. If a user do a
date_bin
with 10 min interval, how to handle the returned data? Aggregate them? Return two sets of data?
I want to highlight this open-ended question and hear how others think
Thanks for filing the ticket and for all the detailed explanations! very enriching
I wonder whether the Postgres behavior is actually that bad. Though it looks weird, it still is generic enough to make it widely applicable. The issue with making date_bin
specifically timezone-aware in my opinion is the fact that future or similar grouping functions (or rolling windows functions) will also have to be implemented in a timezone-aware way, accounting for similar pitfalls and edge cases. The same problem will also occur in other functionalities; for example if you would like to return local time. I personally think that the problem might be better tackled as an additional time functionality, which may be the same way that Postgres does it.
The postgres way of converting UTC to local time of a given timezone is:
select '2024-05-21T12:00:00Z'::timestamp AT TIME ZONE 'UTC' AT TIME ZONE 'Europe/Brussels';
timezone
--
2024-05-21 14:00:00
As already mentioned by Andrew, the AT TIME ZONE
operator in postgres converts a timezone-aware timestamp to local time (with no offset or timezone information), and local time to a timezone-aware timestamp. Though the overloaded functionalities of the AT TIME ZONE
operator in Postgres are weird, they are definitely usable in my opinion. I also like the idea of having a separate to_local_time
/strip_timezone
function, though this would break Datafusions compatibility with Postgres.
Thank you @tustvold and @Abdullahsab3 and @mhilton and @appletreeisyellow for the thoughts.
From my perspective, the current (non timezone aware) date_bin
function has the benefit of being (relatively) simple to implement and fast (as it is some arithmetic calculation on the underlying integer value without having to do DST conversions per row)
Given the differences in underlying timestamp representation between arrow and postgres I do think some difference is likely inevitable and thus likely not a deal breaker.
Here are my suggested next steps @appletreeisyellow tries to prototype one or both proposals and see if we can get it to produce the desired results:
to_local_time
function date_bin
function to make it timezone awareI think the to_local_time
might be the simplest approach.
Signature
that takes Timestamp(.., *)
Timestamp(.., None)
. invoke
function would do the oppositte of whatever cast(Timestamp(.., None) --> Timestamp(.., TZ))
doesI was looking into this issue again. I am still validating this, but I found that using to_char(time AT TIME ZONE 'UTC' AT TIME ZONE 'Europe/Brussels', '%F %X')
also results in correct local time. (see https://docs.rs/chrono/latest/chrono/format/strftime/index.html)
but I found that using
to_char(time AT TIME ZONE 'UTC' AT TIME ZONE 'Europe/Brussels', '%F %X')
also results in correct local time.
@Abdullahsab3 Nice find! Seems like using to_char()
matches what you described in https://github.com/apache/datafusion/issues/10368 for to_local_time()
. I'm curious if you are satisfied with to_char()
solution? Or is there anything that to_char()
cannot achieve?
I had been prototyping what @alamb suggested above since yesterday. I was able to do step 1. Note that the to_local_time
function here is different from the one suggested in https://github.com/apache/datafusion/issues/10368. I was about to do step 2, and saw your message and want to check if we need to re-evaluate this issue
I am able to verified that to_char()
works in date_bin()
on timestamp with timezone and daylight saving time 🎉
select
arrow_typeof(column1) as arrow_type,
column1,
date_bin(interval '1 day', to_char(column1 AT TIME ZONE 'UTC' AT TIME ZONE 'Europe/Brussels', '%F %X')) as "date_bin"
from t_timezone;
+------------------------------------------------+---------------------------+---------------------+
| arrow_type | column1 | date_bin |
+------------------------------------------------+---------------------------+---------------------+
| Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-01-01T00:00:01+01:00 | 2024-01-01T00:00:00 | <-- binned correctly
| Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-02-01T00:00:01+01:00 | 2024-02-01T00:00:00 |
| Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-03-01T00:00:01+01:00 | 2024-03-01T00:00:00 |
| Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-04-01T00:00:01+02:00 | 2024-04-01T00:00:00 | <-- binned correctly on daylight savings time
| Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-05-01T00:00:01+02:00 | 2024-05-01T00:00:00 |
| Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-06-01T00:00:01+02:00 | 2024-06-01T00:00:00 |
| Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-07-01T00:00:01+02:00 | 2024-07-01T00:00:00 |
| Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-08-01T00:00:01+02:00 | 2024-08-01T00:00:00 |
| Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-09-01T00:00:01+02:00 | 2024-09-01T00:00:00 |
| Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-10-01T00:00:01+02:00 | 2024-10-01T00:00:00 |
| Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-11-01T00:00:01+01:00 | 2024-11-01T00:00:00 |
| Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-12-01T00:00:01+01:00 | 2024-12-01T00:00:00 |
+------------------------------------------------+---------------------------+---------------------+
12 row(s) fetched.
Elapsed 0.024 seconds.
select
column1 as "time in UTC",
column1 AT TIME ZONE 'UTC' AT TIME ZONE 'Europe/Brussels' as "time in Brussels",
date_bin(interval '1 day', to_char(column1 AT TIME ZONE 'UTC' AT TIME ZONE 'Europe/Brussels', '%F %X')) as "date_bin"
from t_utc;
+----------------------+---------------------------+---------------------+
| time in UTC | time in Brussels | date_bin |
+----------------------+---------------------------+---------------------+
| 2024-04-30T21:30:00Z | 2024-04-30T23:30:00+02:00 | 2024-04-30T00:00:00 |
| 2024-04-30T22:30:00Z | 2024-05-01T00:30:00+02:00 | 2024-05-01T00:00:00 | <-- binned correctly
| 2024-04-30T23:30:00Z | 2024-05-01T01:30:00+02:00 | 2024-05-01T00:00:00 |
| 2024-05-01T00:00:00Z | 2024-05-01T02:00:00+02:00 | 2024-05-01T00:00:00 |
| 2024-05-01T00:30:00Z | 2024-05-01T02:30:00+02:00 | 2024-05-01T00:00:00 |
| 2024-05-01T10:30:00Z | 2024-05-01T12:30:00+02:00 | 2024-05-01T00:00:00 |
| 2024-05-01T20:30:00Z | 2024-05-01T22:30:00+02:00 | 2024-05-01T00:00:00 |
+----------------------+---------------------------+---------------------+
7 row(s) fetched.
Elapsed 0.023 seconds.
Very nice! Thanks for verifying @appletreeisyellow. I was able to do some minimal validations today on our systems and it seems indeed to be binning correctly. Happy to see this as an alternative to what I was doing as a fix for this issue
My main concerns with the use of to_char
are:
to_char
will parse the given timestamp in the provided format, and return a string, which will be coerced to a (localtime) timestamp (which I assume includes another parsing step).%F %X
will always produce the correct localtime. My concern is mostly the %F part@appletreeisyellow and I just had a chat about what to do about this ticket. It is great that we seem to have found a workaround there are at least two potential issues we see with the workaround:
We think it would be great to improve DataFusion so future users (both of InfluxDB and other systems) who use date_bin
have fast queries and are not confused.
We think the ideal user experience would be as @tustvold suggests for date_bin
to be timezone aware and just "do the right thing and bin dates respecting the timezone of the timestamp".
While this might be different than what postgres does, the underlying treatment if timestamps is already different, so I don't think we need to be quirk for quirk compatible.
Thus we propose @appletreeisyellow will try and complete her prototype work to make date_bin
do the intuitive thing with timestamps with timezones
It is likely we can't devote a large amount of time to this project, so if the implementation is taking too long we may have to abandon this effort or go with a less optimal route (e.g. a function to strip off timezones like Postgres)
I filed https://github.com/apache/arrow-rs/pull/5826 to document more precisely what is going on with casting
I wrote up a high-level doc on how to implement the date_bin
with timezone awareness. If you cannot access the document, feel free to request access and I'll give you the permission to view/comment.
Another thing that needs to be taken into consideration: I think the returned timestamp from date_bin
needs to be in the timezone that was provided as an argument (in the internal Arrow represenation). For example:
The returned timestamp is in correct local time, but the underlying Arrow kernel has no timezone information about it (which I think means that the timestamp will be interpreted as being in UTC)
❯ select to_char('2024-04-30T02:00:00' at time zone 'UTC' at time zone 'Europe/Brussels', '%F %X')::timestamp;
+----------------------------------------------------+
| to_char(Utf8("2024-04-30T02:00:00"),Utf8("%F %X")) |
+----------------------------------------------------+
| 2024-04-30T04:00:00 |
+----------------------------------------------------+
1 row in set. Query took 0.002 seconds.
❯ select arrow_typeof(to_char('2024-04-30T02:00:00' at time zone 'UTC' at time zone 'Europe/Brussels', '%F %X')::timestamp);
+------------------------------------------------------------------+
| arrow_typeof(to_char(Utf8("2024-04-30T02:00:00"),Utf8("%F %X"))) |
+------------------------------------------------------------------+
| Timestamp(Nanosecond, None) |
+------------------------------------------------------------------+
1 row in set. Query took 0.003 seconds.
When we add the timezone information:
❯ select to_char('2024-04-30T02:00:00' at time zone 'UTC' at time zone 'Europe/Brussels', '%F %X') at time zone 'Europe/Brussels';
+----------------------------------------------------+
| to_char(Utf8("2024-04-30T02:00:00"),Utf8("%F %X")) |
+----------------------------------------------------+
| 2024-04-30T04:00:00+02:00 |
+----------------------------------------------------+
1 row in set. Query took 0.001 seconds.
❯ select arrow_typeof(to_char('2024-04-30T02:00:00' at time zone 'UTC' at time zone 'Europe/Brussels', '%F %X') at time zone 'Europe/Brussels');
+------------------------------------------------------------------+
| arrow_typeof(to_char(Utf8("2024-04-30T02:00:00"),Utf8("%F %X"))) |
+------------------------------------------------------------------+
| Timestamp(Nanosecond, Some("Europe/Brussels")) |
+------------------------------------------------------------------+
1 row in set. Query took 0.001 seconds.
The timezone information there is important, as the timestamp will be otherwise interpreted by external integrations (such as Grafana) as being in UTC.
Note that applying at time zone 'Europe/Brussels'
on the results of to_char
will never fail during daylight savings hours, as the returned value from to_char
is always in localtime, meaning that the 'missing/additional hour' should not be an issue for the at time zone
operator.
TLDR; if you're using the to_char
workaround now for localisation, you should add the timezone information to the results
I think the returned timestamp from
date_bin
needs to be in the timezone that was provided as an argument (in the internal Arrow represenation)
Thanks for highlighting it. Yes, this is the expected behavior. The timezone representation of the returned timestamp will be the same as the input timestamp. For example:
None
:
select arrow_typeof(date_bin(interval '1 day', '2024-03-30T00:00:20Z'));
+---------------------------------------------------------------------------------------------------+
| arrow_typeof(date_bin(IntervalMonthDayNano("18446744073709551616"),Utf8("2024-03-30T00:00:20Z"))) |
+---------------------------------------------------------------------------------------------------+
| Timestamp(Nanosecond, None) |
+---------------------------------------------------------------------------------------------------+
select arrow_typeof(date_bin(interval '1 day', '2024-03-30T00:00:20')); +--------------------------------------------------------------------------------------------------+ | arrow_typeof(date_bin(IntervalMonthDayNano("18446744073709551616"),Utf8("2024-03-30T00:00:20"))) | +--------------------------------------------------------------------------------------------------+ | Timestamp(Nanosecond, None) | +--------------------------------------------------------------------------------------------------+
- if the input has a timezone in `UTC`, then the date_bin returns timezone as `Some('UTC')`
```sql
select arrow_typeof(date_bin(interval '1 day', '2024-03-30T00:00:20Z' AT TIME ZONE 'UTC'));
+---------------------------------------------------------------------------------------------------+
| arrow_typeof(date_bin(IntervalMonthDayNano("18446744073709551616"),Utf8("2024-03-30T00:00:20Z"))) |
+---------------------------------------------------------------------------------------------------+
| Timestamp(Nanosecond, Some("UTC")) |
+---------------------------------------------------------------------------------------------------+
Europe/Brussels
, then the date_bin returns timezone as Some('Europe/Brussels')
select arrow_typeof(date_bin(interval '1 day', '2024-03-30T00:00:20Z' AT TIME ZONE 'Europe/Brussels')); +---------------------------------------------------------------------------------------------------+ | arrow_typeof(date_bin(IntervalMonthDayNano("18446744073709551616"),Utf8("2024-03-30T00:00:20Z"))) | +---------------------------------------------------------------------------------------------------+ | Timestamp(Nanosecond, Some("Europe/Brussels")) | +---------------------------------------------------------------------------------------------------+
I probably missed some prior discussion. Do we want date_bin
syntax?
Can we support more standard date_trunc('day', timestamp_tz)
?
I think we already have date_trunc docs: https://datafusion.apache.org/user-guide/sql/scalar_functions.html#date-trunc code: https://github.com/apache/datafusion/blob/dce77db316beb5bf5dc2326c350b2d642edbfed8/datafusion/functions/src/datetime/date_trunc.rs#L361
I think the difference between date_trunc
and date_bin
is that date_bin can bin to dates that are not aligned with the start/stop of the interval (e.g. you could bin days at noon rather than at midnight)
However it is a good question if date_trunc
would be a better way to get the desired behavior
Update: @appletreeisyellow merged an workaround -- to_local_time
in https://github.com/apache/datafusion/pull/11401
She has no further current concrete plans to work on this ticket at this time
FYI - I came across the Jiff crate the other day that looks to have some nice support for tz aware calendar handling. From the benchmarks I've run it's not quite as quick as chrono for some things and about equivalent to others. I would recommend anyone who wants to tackle this issue to have a close look at it to see if it solves some of the intricacies around tz conversion and DST arithmetic and rounding.
Author of Jiff here. I don't have a ton of context on the specific problem in this issue, but I'd be happy to field questions about DST safe arithmetic. Jiff in particular is heavily inspired by Temporal, and Temporal (and Jiff) both intend to match RFC 5545 for duration handling.
(It's still early days and I think there is a fair bit of room for improvement for Jiff in terms of perf. Both in micro-optimizing its existing code and in adding new APIs to facilitate faster ops when the use case demands it.)
Is your feature request related to a problem or challenge?
Summary
Given this data in UTC timestamps
How to create the data
```sql -- Dates right around the first of May create or replace table t AS VALUES ('2024-04-30T21:30:00'), ('2024-04-30T22:30:00'), ('2024-04-30T23:30:00'), ('2024-05-01T00:00:00'), ('2024-05-01T00:30:00'), ('2024-05-01T10:30:00'), ('2024-05-01T20:30:00') ; create or replace view t_utc as select column1::timestamp AT TIME ZONE 'UTC' as "column1" from t; ```
I would like to be able to get this output (bin the data by their day in Brussels, properly accounting for daylight savings time).
We can't figure out how to do this at the moment
Details
Broken out from @Abdullahsab3's great ticket https://github.com/apache/datafusion/issues/10368
We would like to apply date_bin and have it bin the dates in local time, properly including daylight savings time calculations
For example, given the following data at 12:01AM in the morning on the first day of each month,I would like to bin it into the first day of each month.
This works as expected when there is no timestamp (all timestamps end up in the correct day)
However, once the timestamp is in a timezone it has problems:
The timestamps now binned at incorrectly (the are binned based on UTC midnight, not midnight in Brussels) so for example a time in
2024-01-01
goes into a2023-12-31
bucket:The third argument to
date_bin
allows changing the starting offset for a day, so let's set it to midnight in Europe/Brussels time. This almost works but does not take into account daylight savings time so the timestamps are binned incorrectly during DST:Describe the solution you'd like
Describe alternatives you've considered
No response
Additional context
No response