apache / arrow-rs

Official Rust implementation of Apache Arrow
https://arrow.apache.org/
Apache License 2.0
2.57k stars 775 forks source link

Comparison of `DataType::Interval` and `Interval**Array` #5200

Open alamb opened 10 months ago

alamb commented 10 months ago

Is your feature request related to a problem or challenge? Please describe what you are trying to do. Data of type DateTime::Interval (spans of time measured in calendar units like days and months) are tricky to compare because the absolute size (in number of seconds) is not a fixed quantity. For example the 1 month is 28 days for February but 1 month is 31 days in December.

This makes the seemingly simple operation of comparing two intervals quite complicated in practice. For example is 1 month more or less than 30 days? The answer depends on what month you are talking about.

Arrow also includes a type DateTime::Duration that is a fixed width of time and suffers from many fewer challenges in comparisons, but may not be as intuitive for humans to use;

https://github.com/apache/arrow-rs/pull/5180 from @berkaysynnada and @ozankabak noted that while in general it is impossible to determine if any arbitrary Interval is greater/equal/less than another arbitrary Interval it is possible to to order some intervals. For example, 10 days is always less than 30 days, even though it is not possible to know if 1 month is less than 30 days without additional information not encoded in the Interval

Describe the solution you'd like I am not sure.

I filed this ticket to 1. Try and describe the challenge beter and 2. Have a discussion about what, if any, additional semantics are needed

Describe alternatives you've considered

Nothing / Improve Documentation

The current handling of intervals is now documented after https://github.com/apache/arrow-rs/pull/5192 so I hope there is less confusion about the semantics. If downstream crates want to compare intervals they can do so by implementing their own kernels

New kernels

One possibility would be to add new comparison kernels that were specific to intervals and implemented partial order interval comparisons as proposed in https://github.com/apache/arrow-rs/pull/5180

A comparison returns true if it holds certainly. Otherwise, it returns false. Going back to our example, we return false for both 1 month < 30 days and 1 month > 30 days. It is impossible to impose a total order on intervals, but we can impose a consistent partial order with this logic.

Something like

let arr1 = IntervalMonthDayNanoArray::from(...);
let arr2 = IntervalMonthDayNanoArray::from(...);

// compare arr1 and arr2 using interval specific logic / partial order
let res = interval_partial_lt()

This would have the benefit of implementing interval specific semantics and would be very clear it is different than other comparison kernels

Forced cast to Duration

One way to define a deterministic comparsison is to covert months to 30 day increments. That would result in a well defined and fast comparisons, but may lead to intuitive results

Additional context Here is a related discussion in DataFusion: https://github.com/apache/arrow-datafusion/issues/8468

tustvold commented 10 months ago

I think we should establish what other systems do, such as postgres, and go from there

berkaysynnada commented 10 months ago

AFAIK Postgre defines a month as exactly 30 days. While this simplification facilitates certain implementations, it can potentially introduce silent bugs. In our DataFusion fork, we have implemented a PartialOrd with semantics as described in this comment. In short, our implementation treats '1 month < 30 days' as false and '1 month >= 30 days' as false. However, '1 month + 1 day > 1 month' is true, and '1 month > 27 days' is also true. The fix also addresses problematic cases involving negative-signed LSD blocks.

If you're interested in how such an implementation could be integrated into DataFusion, we can help with the process.

tustvold commented 10 months ago

AFAIK Postgre defines a month as exactly 30 days

This was what I had understood as well. What do you think of updating the cast kernel to allow casting interval expressions with non-zero days/months to durations, using the same assumptions as postgres makes?

This would have a number of quite compelling advantages in my opinion:

My main hesitancy with defining our own custom partial ordering is that I am not very confident of all the nuances involved in performing such an operation correctly, how to communicate these to users, and am concerned about adding further edge-cases for people to be wary of. Whilst I agree the postgres approach is making some pretty crude assumptions, it is easy to understand and comes with prior precedent...

alamb commented 10 months ago

What do you think of updating the cast kernel to allow casting interval expressions with non-zero days/months to durations, using the same assumptions as postgres makes?

This makes sense to me