apache / arrow-rs

Official Rust implementation of Apache Arrow
https://arrow.apache.org/
Apache License 2.0
2.55k stars 766 forks source link

Subtracting `Timestamp` from `Timestamp` should produce a `Duration` (not `Timestamp`) #3964

Closed alamb closed 1 year ago

alamb commented 1 year ago

Describe the bug Subtracting two Timestamp columns results in another Timestamp which is not correct

To Reproduce

fn subtract_timestamps() {
    let arr1 = TimestampNanosecondArray::from(vec![
        1_000,
        1_000_000,
        1_000_000_000,
        1_000_000_000_000,
        1_000_000_000_000_000,
        1_000_000_000_000_000_000,
    ]);

    let arr2 = TimestampNanosecondArray::from(vec![
        2_000,
        2_000_000,
        2_000_000_000,
        2_000_000_000_000,
        2_000_000_000_000_000,
        2_000_000_000_000_000_000,
    ]);

    println!(
        "input:\n{}\n{}",
        pretty_format_columns("arr1", &[Arc::new(arr1.clone()) as _]).unwrap(),
        pretty_format_columns("arr2", &[Arc::new(arr2.clone()) as _]).unwrap(),
    );

    let interval = subtract_dyn(&arr2, &arr1).unwrap();

    println!(
        "output{}:\n{}",
        interval.data_type(),
        pretty_format_columns("interval", &[Arc::new(interval.clone()) as _]).unwrap(),
    );
}

Which produces

input:
+----------------------------+
| arr1                       |
+----------------------------+
| 1970-01-01T00:00:00.000001 |
| 1970-01-01T00:00:00.001    |
| 1970-01-01T00:00:01        |
| 1970-01-01T00:16:40        |
| 1970-01-12T13:46:40        |
| 2001-09-09T01:46:40        |
+----------------------------+
+----------------------------+
| arr2                       |
+----------------------------+
| 1970-01-01T00:00:00.000002 |
| 1970-01-01T00:00:00.002    |
| 1970-01-01T00:00:02        |
| 1970-01-01T00:33:20        |
| 1970-01-24T03:33:20        |
| 2033-05-18T03:33:20        |
+----------------------------+
output Timestamp(Nanosecond, None):
+----------------------------+
| interval                   |
+----------------------------+
| 1970-01-01T00:00:00.000001 |
| 1970-01-01T00:00:00.001    |
| 1970-01-01T00:00:01        |
| 1970-01-01T00:16:40        |
| 1970-01-12T13:46:40        |
| 2001-09-09T01:46:40        |
+----------------------------+

Expected behavior ~I expect the output to be an interval of type Interval(MonthDayNano) (not Timestamp)~

Updated (after discussion with @tustvold ) I expect the output to be a duration of type Duration(unit) where unit is the same as the source Timestamp(unit) (not Timestamp)

Additional context Here is what postgres does:

postgres=# create table test as select now() - now();
SELECT 1
postgres=# \d test
                 Table "public.test"
  Column  |   Type   | Collation | Nullable | Default
----------+----------+-----------+----------+---------
 ?column? | interval |           |          |

postgres=#
tustvold commented 1 year ago

I expect the output to be an interval of type Interval(MonthDayNano) (not Timestamp)

Should it be an interval, or should it be a duration? The latter is significantly easier to reason about, although it would appear postgres lacks such a construct.

alamb commented 1 year ago

Should it be an interval, or should it be a duration?

I guess I was thinking MonthDayNano with month=0 and day=0 would be the same as a Duration 🤔 Not sure if that makes sense

alamb commented 1 year ago

I would also be fine with Duration -- changing ticket

alamb commented 1 year ago

I think this is a good first issue because the semantics are well defined and there are some existing examples

leetcode-1533 commented 1 year ago

hi, I would like to take this task

leetcode-1533 commented 1 year ago

it is my first rust related PR, just want to highlight my plan for implement this:

Within function subtract_dyn(), we currently use typed_dict_math_op() and math_op().

I would like to impement antoher function for timestamp types, called timestamp_math_op(), which deals with timestamp type specific operations. And always returns Duration when calculating between two timestamps.

alamb commented 1 year ago

I would like to impement antoher function for timestamp types, called timestamp_math_op(), which deals with timestamp type specific operations. And always returns Duration when calculating between two timestamps.

Sounds good to me

berkaysynnada commented 1 year ago

it is my first rust related PR, just want to highlight my plan for implement this:

Within function subtract_dyn(), we currently use typed_dict_math_op() and math_op().

I would like to impement antoher function for timestamp types, called timestamp_math_op(), which deals with timestamp type specific operations. And always returns Duration when calculating between two timestamps.

Thanks for working on this issue :) After merging that PR in datafusion, timestamp - timestamp operations give results in Interval types, and it is planned to get arithmetics moved to arrow-rs. Can you check that PR to not have an inconsistent behavior?

tustvold commented 1 year ago

After merging that PR in datafusion, timestamp - timestamp operations give results in Interval types,

Can we perhaps revisit this decision in DataFusion, interval types are probably not the correct type to be returning for such operations on timestamps. A duration faithfully represents the difference between two timestamps in absolute time, an interval does not, instead representing a logical quantity the meaning of which depends on the timestamp it is applied to. I would expect timestamp arithmetic to return Duration, with query engines able to insert type coercions if they really need an interval for some reason. Crucially a duration can be converted to an interval, but the reverse transformation is not generally possible

alamb commented 1 year ago

I think it would be fine to support converting to duration in arrow-rs (and we can convert to Interval in datafusion as needed)

The reason we are pushing ahead with interval (rather than interval and duration) in DataFusion is to get something working incrementally without having to sort out all the subtleties with Intervals, Durations, arithmetic and conversions.

Then I think over time we can and will add more sophistication (like making the distinction between Duration and Interval and coercing automatically between them) to DataFusion

Thus, I suggest we get the kernels correct in arrow-rs (and provide the appropriate casting operations) and then we can upgrade DataFusion to use them.

So in this case, let's have timestamp - timestamp produce duration in arrow.rs sounds good.

I will also file a ticket about casting to/from Duration and Interval

alamb commented 1 year ago

I filed https://github.com/apache/arrow-rs/issues/3998 to track casting durations to/from intervals

ozankabak commented 1 year ago

The latter is significantly easier to reason about, although it would appear postgres lacks such a construct.

Yes, the issue is most DB's (and SQL itself) simply use the timestamp - timestamp = interval pattern. However, from our perspective using durations is fine as long as there is casting/coercing mechanism to take care of the transformation cheaply.

alamb commented 1 year ago

Yes, the issue is most DB's (and SQL itself) simply use the timestamp - timestamp = interval pattern. However, from our perspective using durations is fine as long as there is casting/coercing mechanism to take care of the transformation cheaply.

I think the core problem is that the clear distinction between "duration" and "interval" that is made in Arrow and Rust's standard library is not found in SQL (e.g. there is no SQL duration type). The coercion / casting logic in DataFusion I think is the right place to reconcile the various Arrow types (Intervals with different time units, durations with different time units)

So in other words, even though I expect SQL / DataFusion users to mostly work with Intervals, having DataFusion sort out how to call the appropriate interval kernels in arrow-rs with coercion, etc would be the ideal approach

ozankabak commented 1 year ago

SGTM

tustvold commented 1 year ago

label_issue.py automatically added labels {'arrow'} from #4244