Open jeffreyssmith2nd opened 3 months ago
I'd be happy to take this if it's not already being worked on.
take
i'd expect ::timestamp
/ CAST(x AS timestamp)
to cast to a timestamp without time zone type.
Is this what None
zone indicates?
@findepi It appears so
The SQL spec compliant behavior when casting timestamp with time zone
to timestamp
is to strip time zone information and retain year/month/day/hour/minute/second fields.
You can observe this behavior in Trino
trino> SELECT x "twtz", CAST(x AS timestamp) AS "t" FROM (VALUES (CAST('2024-08-31 11:47:58.977899 Europe/Warsaw' AS timestamp with time zone))) t(x);
twtz | t
---------------------------------------+-------------------------
2024-08-31 11:47:58.978 Europe/Warsaw | 2024-08-31 11:47:58.978
(1 row)
In a sense PostgreSQL's timestamp with time zone
to timestamp
cast follows the spec because PostgreSQL's timestamp with time zone
value doesn't include the time zone information. It's "a point in time in UTC.
postgres=# SELECT x "twtz", CAST(x AS timestamp) AS "t" FROM (VALUES (CAST('2024-08-31 11:47:58.977899 Europe/Warsaw' AS timestamp with time zone))) t(x);
twtz | t
-------------------------------+----------------------------
2024-08-31 09:47:58.977899+00 | 2024-08-31 09:47:58.977899
Now, DF behavior when casting timestamp with time zone
to timestamp
seems to be to convert timestamp with time zone to UTC (retaining point in time) and then strip zone
> SELECT x "twtz", CAST(x AS timestamp) AS "t" FROM (VALUES (CAST('2024-08-31 11:47:58.977899 Europe/Warsaw' AS timestamp with time zone))) t(x);
+----------------------------------+----------------------------+
| twtz | t |
+----------------------------------+----------------------------+
| 2024-08-31T11:47:58.977899+02:00 | 2024-08-31T09:47:58.977899 |
+----------------------------------+----------------------------+
IMO we should change DF cast from timestamp with time zone
to timestamp
so that it retains year/month/day/hour/minute/second fields and strips the zone information.
cc @alamb
https://github.com/apache/arrow-rs/issues/5827 seems related.
So would the expected outcome be that timestamp
will always have UTC time (which it seems to already be currently doing)?
I wouldn't say that timestamp has UTC time. timestamp is local date/time. You can think of it as a struct with fields year/month/day/hour/minute/seconds(fractional) and it has no relationship with UTC. However, when doing math on timestamp values using a date/time library it's often to represent it as UTC-based.
I wouldn't say that timestamp has UTC time. timestamp is local date/time. You can think of it as a struct with fields year/month/day/hour/minute/seconds(fractional) and it has no relationship with UTC. However, when doing math on timestamp values using a date/time library it's often to represent it as UTC-based.
Makes sense, so the expectation is for the date-time calculation to be correct for that TZ but no offset?
So for the example above:
> select (now() AT TIME ZONE 'America/Denver')::timestamp;
+----------------------------+
| now() |
+----------------------------+
| 2024-08-28T17:08:20.042920 |
+----------------------------+
Should actually be:
> select (now() AT TIME ZONE 'America/Denver')::timestamp;
+----------------------------+
| now() |
+----------------------------+
| 2024-08-28T11:08:13.578025 |
+----------------------------+
@devanbenz Agreed
if now() AT TIME ZONE 'America/Denver'
is 2024-08-28T11:08:13.578025-06:00
(or 2024-08-28 11:08:13.578025 America/Denver
)
then (now() AT TIME ZONE 'America/Denver')::timestamp
should be 2024-08-28[T]11:08:13.578025
Okay sounds good @findepi -- I do see: https://github.com/apache/arrow-rs/pull/5831/files which appears to be relevant. I think this change would be better suited for arrow vs. datafusion I suppose?
@devanbenz i think so, yes
@findepi @alamb It actually appears that datafusion (and in regards apache arrow) seems to be casting as expected afaik. It looks like the now()
udf in datafusion is inconsistent with how other engines do it. Taking a look at 2 other systems Postgres
and Clickhouse
I'm seeing that when calling now()
without any adjustments it returns the local TZ. It appears that datafusion is returning Zulu time. Not sure if this is by design. But I'm recalling a thread: https://github.com/apache/datafusion/issues/12357#issuecomment-2334108527
It seems to me we also haven't documented anywhere the "the built in SQL dialect tries to follow postgresql semantics when possible"
Should built in functions be following PG as well?
Postgres
postgres=# SELECT now();
now
-------------------------------
2024-09-10 11:22:41.895841-05
(1 row)
postgres=# SELECT now()::timestamp;
now
----------------------------
2024-09-10 11:22:47.294941
(1 row)
postgres=# SELECT now()::timestamptz;
now
-------------------------------
2024-09-10 11:22:52.118688-05
(1 row)
Clickhouse
:) SELECT now();
SELECT now()
Query id: 13a227b8-f867-49ad-9941-8d8bb70b9c92
┌───────────────now()─┐
1. │ 2024-09-10 11:25:49 │
└─────────────────────┘
1 row in set. Elapsed: 0.005 sec.
:) SELECT now()::timestamp;
SELECT CAST(now(), 'timestamp')
Query id: adb042b8-7ffb-4e98-bbb4-21affade7201
┌─CAST(now(), 'timestamp')─┐
1. │ 2024-09-10 11:25:56 │
└──────────────────────────┘
1 row in set. Elapsed: 0.002 sec.
Datafusion
> SELECT now();
+-----------------------------+
| now() |
+-----------------------------+
| 2024-09-10T16:26:39.928247Z |
+-----------------------------+
1 row(s) fetched.
Elapsed 0.007 seconds.
> SELECT now()::timestamp;
+----------------------------+
| now() |
+----------------------------+
| 2024-09-10T16:26:44.623198 |
+----------------------------+
1 row(s) fetched.
Elapsed 0.010 seconds.
> SELECT now()::timestamptz;
+-----------------------------+
| now() |
+-----------------------------+
| 2024-09-10T16:26:52.911837Z |
+-----------------------------+
1 row(s) fetched.
Elapsed 0.006 seconds.
Note: All queries were ran at around 11:30 AM CST. The PG and CH queries are using local TZ where as DF does not and uses Zulu time instead. Clickhouse does not support timestamptz hence why it was not in the query outputs given in the examples.
Taking a look at a similar query with Clickhouse
I see that casting to a timestamp
will strip the timezone information -- since they are using Apache Arrow the functionality should be the same afaik:
:) SELECT toTimeZone(now(), 'America/Denver');
SELECT toTimeZone(now(), 'America/Denver')
Query id: 9a79d47a-8551-4199-8b1a-ae9682259cbc
┌─toTimeZone(now(), 'America/Denver')─┐
1. │ 2024-09-10 10:45:14 │
└─────────────────────────────────────┘
1 row in set. Elapsed: 0.007 sec.
:) SELECT toTimeZone(now(), 'America/Denver')::timestamp;
SELECT CAST(toTimeZone(now(), 'America/Denver'), 'timestamp')
Query id: b4ab2ec2-e361-4d83-8235-2ec9e95a43b5
┌─CAST(toTimeZone(now(), 'America/Denver'), 'timestamp')─┐
1. │ 2024-09-10 11:45:19 │
└────────────────────────────────────────────────────────┘
1 row in set. Elapsed: 0.003 sec.
But instead of stripping it and making it Zulu time it just sets it back to whatever the now()
call would do which is what Datafusion currently is doing... We are just using Zulu instead of the local TZ.
@alamb I think that we shouldn't modify the current casting since it appears to be functioning as expected per the arrow spec https://github.com/apache/arrow-rs/issues/5827#issuecomment-2206798023 (at least for this bug specifically) but we should change the way that now()
works to have it function as expected and falls in line with the postgres spec 🤔
Just for the sake of clarity, my original description of the behaviour holds with times other than now()
.
> select arrow_typeof(('2020-01-01T00:00:00Z' AT TIME ZONE 'America/New_York'));
+-------------------------------------------------+
| arrow_typeof(Utf8("2020-01-01T00:00:00Z")) |
+-------------------------------------------------+
| Timestamp(Nanosecond, Some("America/New_York")) |
+-------------------------------------------------+
1 row(s) fetched.
Elapsed 0.002 seconds.
> select arrow_typeof(('2020-01-01T00:00:00Z' AT TIME ZONE 'America/New_York')::timestamp);
+--------------------------------------------+
| arrow_typeof(Utf8("2020-01-01T00:00:00Z")) |
+--------------------------------------------+
| Timestamp(Nanosecond, None) |
+--------------------------------------------+
1 row(s) fetched.
Elapsed 0.002 seconds.
That being said, if this is the expected behaviour of other SQL dialects then it seems reasonable to me that it would behave the same way in DataFusion as well.
Just for the sake of clarity, my original description of the behaviour holds with times other than
now()
.> select arrow_typeof(('2020-01-01T00:00:00Z' AT TIME ZONE 'America/New_York')); +-------------------------------------------------+ | arrow_typeof(Utf8("2020-01-01T00:00:00Z")) | +-------------------------------------------------+ | Timestamp(Nanosecond, Some("America/New_York")) | +-------------------------------------------------+ 1 row(s) fetched. Elapsed 0.002 seconds. > select arrow_typeof(('2020-01-01T00:00:00Z' AT TIME ZONE 'America/New_York')::timestamp); +--------------------------------------------+ | arrow_typeof(Utf8("2020-01-01T00:00:00Z")) | +--------------------------------------------+ | Timestamp(Nanosecond, None) | +--------------------------------------------+ 1 row(s) fetched. Elapsed 0.002 seconds.
That being said, if this is the expected behaviour of other SQL dialects then it seems reasonable to me that it would be behave the same way in DataFusion as well.
Ah yeah just testing something similar out in Clickhouse
I'm seeing that it strips the timezone information, which is in line with how arrow currently works.
:) SELECT toTimeZone('2024-09-10 11:45:19'::timestamp, 'America/Denver');
SELECT toTimeZone(CAST('2024-09-10 11:45:19', 'timestamp'), 'America/Denver')
Query id: 1ae2e612-51cd-47e1-9cbb-81acb5fae491
┌─toTimeZone(CAST('2024-09-10 11:45:19', 'timestamp'), 'America/Denver')─┐
1. │ 2024-09-10 10:45:19 │
└────────────────────────────────────────────────────────────────────────┘
1 row in set. Elapsed: 0.002 sec.
:) SELECT toTimeZone('2024-09-10 11:45:19'::timestamp, 'America/Denver')::timestamp;
SELECT CAST(toTimeZone(CAST('2024-09-10 11:45:19', 'timestamp'), 'America/Denver'), 'timestamp')
Query id: 2463bbee-363b-4339-91b2-4c9055e064ce
┌─CAST(toTimeZone(CAST('2024-09-10 11:45:19', 'timestamp'), 'America/Denver'), 'timestamp')─┐
1. │ 2024-09-10 11:45:19 │
└───────────────────────────────────────────────────────────────────────────────────────────┘
1 row in set. Elapsed: 0.002 sec.
But taking a look at postgres that is not the case as you had mentioned above:
postgres=# SELECT ('2024-09-10 11:45:19' AT TIME ZONE 'America/Denver');
timezone
---------------------
2024-09-10 10:45:19
(1 row)
postgres=# SELECT ('2024-09-10 11:45:19' AT TIME ZONE 'America/Denver')::timestamp;
timezone
---------------------
2024-09-10 10:45:19
(1 row)
Taking a look at MariaDB I'm seeing a similar output:
MariaDB [(none)]> SELECT TIMESTAMP(CONVERT_TZ('2024-09-10 11:45:19', '-5:00', '-6:00'));
+----------------------------------------------------------------+
| TIMESTAMP(CONVERT_TZ('2024-09-10 11:45:19', '-5:00', '-6:00')) |
+----------------------------------------------------------------+
| 2024-09-10 10:45:19 |
+----------------------------------------------------------------+
1 row in set (0.001 sec)
I think the now()
function should probably be changed as users would most likely expect it to function similar to other engines but as for the current casting methodology I'm going to update https://github.com/apache/arrow-rs/issues/5827 with a proposal thats more in line with how established SQL database systems currently operate with timestamp -> timezone update -> timestamp
conversion. As well as take a look at https://github.com/apache/arrow/issues?q=sort%3Aupdated-desc+is%3Aissue+is%3Aopen to see if there are any issues related to timezone casting in the original arrow spec. Its strange that there appears to be a discrepancy with casting to timestamp in which Arrow is stripping timezones vs how more "traditional" RDBMs' are not. Seeing as Clickhouse uses the original Apache Arrow format vs Datafusion using arrow-rs it does appear to fall in line with the spec.
Continuation from this thread
The TL;DR is that we change nothing since we are in line with the current Apache Arrow expectations and datafusion functions similar to a different system that uses Arrow--Clickhouse it is consistent. This proposal is under the assumption that we desire to be completely in line with the current Apache Arrow specification though.
This investigation did lead me to two other findings in which I think should be action'ed on. They are not arrow-rs
related but Datafusion related.
now()
and the default timestamp value should function similar to how other systems do it. Unless specified it should be using the local system timestamp. Not sure if this is desired but it appears to be how other systems work.
desc (SELECT toTimeZone('2024-09-10 11:45:19'::timestamp, 'America/Denver')::timestamp a);
┌─name─┬─type─────┬
│ a │ DateTime │
└──────┴──────────┴
desc (SELECT toTimeZone('2024-09-10 11:45:19'::timestamp, 'America/Denver')::timestamp('America/Los_Angeles') a); ┌─name─┬─type────────────────────────────┬ │ a │ DateTime('America/Los_Angeles') │ └──────┴─────────────────────────────────┴
*I believe the functionality is there in the code since I see that the `Timestamp` datatype can have an input parameter of `Option(Timezone)` but I do not think the SQL parsing functionality operates as expected at this moment in time.*
Datafusion needs to implement something similar to the following in Clickhouse for casting to timestamp with a specified timezone:
I think you can do it like
select now() AT TIME ZONE 'America/Denver';
Though it would be awesome if that was documented (it doesn't appear to be in the SQL reference): https://datafusion.apache.org/search.html?q=AT+TIMEZONE
Datafusion needs to implement something similar to the following in Clickhouse for casting to timestamp with a specified timezone:
I think you can do it like
select now() AT TIME ZONE 'America/Denver';
Though it would be awesome if that was documented (it doesn't appear to be in the SQL reference): datafusion.apache.org/search.html?q=AT+TIMEZONE
Taking a look at this it appears that casting to a timezone does work in datafusion but it doesn't function to how I would expect (or maybe it needs to be a different type of cast similar to clickhouse syntax?):
Clickhouse:
:) SELECT toTimeZone(now(), 'America/Denver')::timestamp;
SELECT CAST(toTimeZone(now(), 'America/Denver'), 'timestamp')
Query id: 2f766f9f-9723-41dc-9a01-901421793e65
┌─CAST(toTimeZone(now(), 'America/Denver'), 'timestamp')─┐
1. │ 2024-09-13 16:00:10 │
└────────────────────────────────────────────────────────┘
1 row in set. Elapsed: 0.002 sec.
:) SELECT toTimeZone(now(), 'America/Denver')::timestamp('America/Denver');
SELECT CAST(toTimeZone(now(), 'America/Denver'), 'timestamp(\'America/Denver\')')
Query id: db0d0937-89b8-4ea4-93cd-d07c157733b2
┌─CAST(toTimeZone(now(), 'America/Denver'), 'timestamp(\'America/Denver\')')─┐
1. │ 2024-09-13 15:00:21 │
└────────────────────────────────────────────────────────────────────────────┘
1 row in set. Elapsed: 0.002 sec.
Datafusion:
> select (now() at time zone 'America/Chicago');
+----------------------------------+
| now() |
+----------------------------------+
| 2024-09-13T15:45:08.013132-05:00 |
+----------------------------------+
1 row(s) fetched.
Elapsed 0.007 seconds.
> select (now() at time zone 'America/Chicago')::timestamp;
+----------------------------+
| now() |
+----------------------------+
| 2024-09-13T20:45:16.085603 |
+----------------------------+
1 row(s) fetched.
Elapsed 0.007 seconds.
> select (now() at time zone 'America/Chicago')::timestamp at time zone 'America/Chicago';
+----------------------------------+
| now() |
+----------------------------------+
| 2024-09-13T20:45:28.485804-05:00 |
+----------------------------------+
1 row(s) fetched.
Elapsed 0.008 seconds.
Thats more so what I mean -- that when casting to a timestamp with a given timezone the output should abide by the timezone where it appears that AT TIME ZONE
casts from a timestamp
to a timestamptz
(in postgres terminology).
Unfortunately as it stands the now()
function nor any other UDF function has no access to the timezone of the server (though it could likely be retrieved) nor the default timezone set in ExecutionOptions (datafusion.execution.time_zone
). I would like to fix that at some point but I am fairly certain it would mean a breaking change to UDF's as they are currently implemented.
@Omega359 good point we should be able to provide a UDF with some form of environment state while maintaining backwards-compatibility.
I think you could potentially get the server time from the SimplifyInfo
passed to
https://docs.rs/datafusion/latest/datafusion/logical_expr/struct.ScalarUDF.html#method.simplify
(aka you could rewrite now()
into something like now('Americas/New_York')
🤔
I looked into that @alamb but I think what is required is access to ConfigOptions (so we can get tz from config, etc). I don't think SimplifyInfo
provides that. I was toying with some approaches this weekend for this and didn't come up with anything simple. The most obvious option to me is to add an InvokeInfo
that provides access to ExecutionProps
and ConfigOptions
to the arguments for invoke_batch, etc. That of course is a breaking change though. Beyond that I toyed with:
Augmenting SimplyInfo
to have ConfigOptions
could work but it's a lot of effort to rewrite calls just to apply tz info (or anything else that a UDF might want from config) . Seems like applying a jackhammer to insert a screw.
Seems like applying a jackhammer to insert a screw.
Yeah I agree. There are other usecases for storing state / customized in UDFs (like pre-compiling regexps once per query rather than once per batch, for example). I keep trying to find some common API / pattern we can put in and follow
Related discussions
Describe the bug
Using the
::timestamp
(or CAST()) function strips timezone information.To Reproduce
Or as seen with actual values:
Expected behavior
I would expect that using the
::timestamp
cast on a timestamp would not impact its timezone. This is how Postgres behaves.Additional context
No response