apache / arrow-rs

Official Rust implementation of Apache Arrow
https://arrow.apache.org/
Apache License 2.0
2.45k stars 726 forks source link

Propose change timestamp casting with timezone to without timezones (also parsing of timestamps without timezones) #5827

Open alamb opened 3 months ago

alamb commented 3 months ago

Is your feature request related to a problem or challenge? Please describe what you are trying to do. This is in the context of implementing date_bin for timestamps with timezones: https://github.com/apache/datafusion/issues/10602

I made https://github.com/apache/arrow-rs/pull/5826 to document the behavior of casting timestamps and I found it very confusing. Specifically when you cast from Timestamp(None) to Timestamp(Some(tz)) and then back to Timetamp(None) the underlying timestamp values are changed as shown in this example

use arrow_array::Int64Array;
use arrow_array::types::{TimestampSecondType};
use arrow_cast::{cast, display};
use arrow_array::cast::AsArray;
use arrow_schema::{DataType, TimeUnit};
let data_type  = DataType::Timestamp(TimeUnit::Second, None);
let data_type_tz = DataType::Timestamp(TimeUnit::Second, Some("-05:00".into()));
let a = Int64Array::from(vec![1_000_000_000, 2_000_000_000, 3_000_000_000]);
let b = cast(&a, &data_type).unwrap(); // cast to timestamp without timezone
let b = b.as_primitive::<TimestampSecondType>(); // downcast to result type
assert_eq!(2_000_000_000, b.value(1)); // values are still the same

// Convert timestamps without a timezone to timestamps with a timezone
let c = cast(&b, &data_type_tz).unwrap();
let c = c.as_primitive::<TimestampSecondType>(); // downcast to result type
assert_eq!(2_000_018_000, c.value(1)); // value has been adjusted by offset

// Convert from timestamp with timezone back to timestamp without timezone
let d = cast(&c, &data_type).unwrap();
let d = d.as_primitive::<TimestampSecondType>(); // downcast to result type
assert_eq!(2_000_018_000, d.value(1)); // <---- **** THIS VALUE IS DIFFERENT THAN IT WAS INITITALLY
assert_eq!("2033-05-18T08:33:20", display::array_value_to_string(&d, 1).unwrap());

Thus I wanted to discuss if we should change the behavior to make it less surprising or if there was a reason to leave the current behavior

Describe the solution you'd like

I propose making casting timestamp with a timezone to timestamp without a timezone do the inverse of casting timestamp withpit a timezone to timestamp with a timezone

This would mean the final value of d in the above example is 2_000_000_000, not 2_000_018_000

Describe alternatives you've considered Leave existing behavior

Additional context

tustvold commented 3 months ago

This makes sense to me, the arrow specification doesn't specify either way but I think this would be less surprising

tustvold commented 3 months ago

It turns out I actually flagged this on the original PR that altered the timezone casting logic - https://github.com/apache/arrow-rs/pull/4201#discussion_r1192301482

The argument that convinced me in the end is that when parsing from a string we convert to UTC

let array = StringArray::from(vec!["2010-01-01T00:00:00.123456+08:00"]);
let data_type = DataType::Timestamp(TimeUnit::Nanosecond, None);
let cast = cast(&array, &data_type).unwrap();
let value = cast.as_primitive::<TimestampNanosecondType>().value_as_datetime(0).unwrap();
assert_eq!(value.to_string(), "2009-12-31 16:00:00.123456");

But perhaps we want to change this also?

alamb commented 3 months ago

The argument that convinced me in the end is that when parsing from a string we convert to UTC

I don't quite follow this example.

Given 2010-01-01T00:00:00.123456+08:00 as input, I (naively) would expect that when I cast that to a timestamp without timezone it would look like 2010-01-01T00:00:00.123456, (same thing, no timezone)

It seems in your example that it is cast to 2009-12-31 16:00:00.123456 (likely because the underlying timestamp value was not changed)

Thus this seems like the same behavior and that it would be changed with the proposal.

But I may be missing something

tustvold commented 3 months ago

Right so you're also advocating changing the parsing behaviour, i.e. casting from strings to timestamps, and not just between timestamps.

I agree these should be kept consistent, hence why I raised it

alamb commented 2 months ago

I posted this proposal to the dev list https://lists.apache.org/thread/b6hhsthy9pqhwmjjkox2lbt4qz9zvlvw and on slack/discord to raise awareness

joellubi commented 2 months ago

Thanks for raising this. I think part of the issue is that timestamps without a timezone specifically omit semantics that could be used to convert to/from a specific instant. Given these semantics are lost on conversion, I wouldn't expect the values to roundtrip unless some additional information provided when going from tz-naive to tz-aware.

When casting from tz-aware to tz-naive timestamps, absent any other information, it makes sense to me that physical values are unchanged and the tz is stripped. That's what it currently does. Going the other way leaves more room for interpretation as pointed out by the docs.

Currently casting the other way converts physical values to be UTC-normalized. Perhaps this behavior would be better reserved for a dedicated "conversion" function that takes some additional parameter(s) to specify semantics. Then casting could be treated as a much simpler operation that just adds/strips the specified tz without changing physical values. That would allow your example to roundtrip.

findepi commented 2 months ago

I propose making casting timestamp with a timezone to timestamp without a timezone do the inverse of casting timestamp withpit a timezone to timestamp with a timezone

in SQL specification

So in standard SQL, one cast is not exactly inverse of the other.

westonpace commented 2 months ago

I agree with @alamb 's proposal.

I think the question is what the value value=1262332800, unit=seconds, tz="america/los_angeles" represents?

Personally, I think option 1 is the most rational choice. The value represents the UTC instant (always UTC, only one choice) and the tz tells you how functions should interpret the value (e.g. how to convert it to a string if that is desired).

Option 2 is strange. The value represents "a wall clock time" and the tz tells you the "timezone at which the wall clock time should be interpreted". The way a wall clock time is stored is "a timestamp (there may be multiple) where a wall clock in the UTC time zone would show the desired wall clock time".

If we pick option 1 then the only thing casting does is change how we want to interpret the value in cases where the time zone matters:

The value should never change when casting.

tustvold commented 2 months ago

I think the question is what the value value=1262332800, unit=seconds, tz="america/los_angeles" represents?

I don't think this is under question, as this is specified by the arrow format specification - https://github.com/apache/arrow/blob/main/format/Schema.fbs#L276.

The question instead concerns how casting between timestamps should behave where one or other lacks a timezone. The specification has the following to say about this:

/// However, if a Timestamp column has no timezone value, changing it to a
/// non-empty value requires to think about the desired semantics.
/// One possibility is to assume that the original timestamp values are
/// relative to the epoch of the timezone being set; timestamp values should
/// then adjusted to the Unix epoch (for example, changing the timezone from
/// empty to "Europe/Paris" would require converting the timestamp values
/// from "Europe/Paris" to "UTC", which seems counter-intuitive but is
/// nevertheless correct).

Which is consistent with what we implement

let a = StringArray::from_iter_values([
    "2033-05-18T08:33:20",
    "2033-05-18T08:33:20Z",
    "2033-05-18T08:33:20 +01:00",
]);

let no_timezone = cast(&a, &DataType::Timestamp(TimeUnit::Nanosecond, None)).unwrap();
let back = cast(&no_timezone, &DataType::Utf8).unwrap();
assert_eq!(
    back.as_string::<i32>(),
    &StringArray::from_iter_values([
        "2033-05-18T08:33:20",
        "2033-05-18T08:33:20",
        "2033-05-18T07:33:20" <---- this issue is proposing changing this to 2033-05-18T08:33:20
    ])
);

let with_timezone = cast(
    &a,
    &DataType::Timestamp(TimeUnit::Nanosecond, Some("+01:00".into())),
)
.unwrap();
let back = cast(&with_timezone, &DataType::Utf8).unwrap();
assert_eq!(
    back.as_string::<i32>(),
    &StringArray::from_iter_values([
        "2033-05-18T08:33:20+01:00",
        "2033-05-18T09:33:20+01:00",
        "2033-05-18T08:33:20+01:00",
    ])
)

Where the specification is ambiguous is what to do when going in the reverse direction, i.e. from a timestamp with a timezone to one without a timezone. Currently we use the time in the UTC epoch. i.e. 2033-05-18T08:33:20 +01:00 -> 2033-05-18T07:33:20. As @alamb rightly points out this is not an inverse of the opposite direction, where 2033-05-18T08:33:20 -> 2033-05-18T08:33:20 +01:00, as recommended by the arrow specification.

The proposal, as far as I understand it, is for 2033-05-18T08:33:20 +01:00 -> 2033-05-18T08:33:20, making the operations "round-trip".

The value should never change when casting.

The arrow format explicitly calls out that the value should change

Edit: It is perhaps worth noting that in the case of a timestamp on the daylight savings boundary, taking the local timestamp instead of UTC as we do currently loses information, and casting back to that timezone will yield an ambiguous timestamp error. I am not sure if this matters.

Edit edit: If we do opt to change as proposed, the old behaviour could be obtained by first converting to the UTC epoch. This would be a strictly metadata operation, as the values are already UTC when a timezone is present. The proposed behaviour is therefore strictly more expressive.

findepi commented 2 months ago

The proposal, as far as I understand it, is for 2033-05-18T08:33:20 +01:00 -> 2033-05-18T08:33:20,

if indeed proposed so, then it would be in line with SQL spec and so in line with what a few other engines do.

alamb commented 1 month ago

An update here is that we think we have a workaround for our usecase that we have added to DataFusion (a to_local_time function): https://github.com/apache/datafusion/pull/11401

I am not sure there is consensus that this change of behavior is desired (though it is not clear to me that there is consensus it would not be desired either)

Edit: It is perhaps worth noting that in the case of a timestamp on the daylight savings boundary, taking the local timestamp instead of UTC as we do currently loses information, and casting back to that timezone will yield an ambiguous timestamp error. I am not sure if this matters.

This worries me as it sounds like it means it would be impossible to make timestamps roundtripping in all cases

It isn't clear to me that "never error but doesn't round trip" to "round trips but sometimes errors" is an improvement in functionality

Omega359 commented 1 month ago

Postgresql has what I would call 'sane' behaviour in the face of ambiguous timestamps. While I think that PG's has some really odd behaviour with multiple 'at time zone's (see this comment) I think in this case the direction they went with is a solid defensible one. If arrow-rs was to emulate that behaviour then round trips wouldn't error but may possibly result in a different time.

Has anyone looked into what the other arrow implementations do to handle this ambiguous portion of the spec?