Open alamb opened 10 months ago
I think we should establish what other systems do, such as postgres, and go from there
AFAIK Postgre defines a month as exactly 30 days. While this simplification facilitates certain implementations, it can potentially introduce silent bugs. In our DataFusion fork, we have implemented a PartialOrd with semantics as described in this comment. In short, our implementation treats '1 month < 30 days' as false and '1 month >= 30 days' as false. However, '1 month + 1 day > 1 month' is true, and '1 month > 27 days' is also true. The fix also addresses problematic cases involving negative-signed LSD blocks.
If you're interested in how such an implementation could be integrated into DataFusion, we can help with the process.
AFAIK Postgre defines a month as exactly 30 days
This was what I had understood as well. What do you think of updating the cast kernel to allow casting interval expressions with non-zero days/months to durations, using the same assumptions as postgres makes?
This would have a number of quite compelling advantages in my opinion:
My main hesitancy with defining our own custom partial ordering is that I am not very confident of all the nuances involved in performing such an operation correctly, how to communicate these to users, and am concerned about adding further edge-cases for people to be wary of. Whilst I agree the postgres approach is making some pretty crude assumptions, it is easy to understand and comes with prior precedent...
What do you think of updating the cast kernel to allow casting interval expressions with non-zero days/months to durations, using the same assumptions as postgres makes?
This makes sense to me
Is your feature request related to a problem or challenge? Please describe what you are trying to do. Data of type DateTime::Interval (spans of time measured in calendar units like days and months) are tricky to compare because the absolute size (in number of seconds) is not a fixed quantity. For example the
1 month
is 28 days for February but1 month
is 31 days in December.This makes the seemingly simple operation of comparing two intervals quite complicated in practice. For example is
1 month
more or less than30 days
? The answer depends on what month you are talking about.Arrow also includes a type DateTime::Duration that is a fixed width of time and suffers from many fewer challenges in comparisons, but may not be as intuitive for humans to use;
https://github.com/apache/arrow-rs/pull/5180 from @berkaysynnada and @ozankabak noted that while in general it is impossible to determine if any arbitrary
Interval
is greater/equal/less than another arbitraryInterval
it is possible to to order some intervals. For example,10 days
is always less than30 days
, even though it is not possible to know if1 month
is less than30 days
without additional information not encoded in theInterval
Describe the solution you'd like I am not sure.
I filed this ticket to 1. Try and describe the challenge beter and 2. Have a discussion about what, if any, additional semantics are needed
Describe alternatives you've considered
Nothing / Improve Documentation
The current handling of intervals is now documented after https://github.com/apache/arrow-rs/pull/5192 so I hope there is less confusion about the semantics. If downstream crates want to compare intervals they can do so by implementing their own kernels
New kernels
One possibility would be to add new comparison kernels that were specific to intervals and implemented partial order interval comparisons as proposed in https://github.com/apache/arrow-rs/pull/5180
Something like
This would have the benefit of implementing interval specific semantics and would be very clear it is different than other comparison kernels
Forced cast to
Duration
One way to define a deterministic comparsison is to covert
months
to30 day
increments. That would result in a well defined and fast comparisons, but may lead to intuitive resultsAdditional context Here is a related discussion in DataFusion: https://github.com/apache/arrow-datafusion/issues/8468