Closed alamb closed 1 year ago
@waitingkuo / @liukun4515 / @ozankabak I wonder if you have any thoughts about this proposal (related to reducing the impact of https://github.com/apache/arrow-datafusion/issues/7068)
I will discuss with my team and share our thoughts tomorrow.
I think the way you suggested seems to be the most rational option. Since we do not use year and month units for durations, we can maybe ignore those parts while formatting.
I believe @tustvold plans to handle this next week
Thinking about this a bit more, I think it would be cleanest if we added support for this to FormatOptions, this can then be plumbed through CastOptions.
https://github.com/apache/arrow-rs/pull/4581 adds the necessary upstream changes
Closed by #6832
Is your feature request related to a problem or challenge?
In https://github.com/apache/arrow-datafusion/pull/6832 and https://github.com/apache/arrow-datafusion/issues/7068 , @tustvold proposes to use
Duration
consistently for timestamp arithmetic which makes the code simpler and consistent across DataFusionHowever, one implication of this change, at the time of this writigng, is shown https://github.com/apache/arrow-datafusion/pull/6832#discussion_r1267234606
Specifically, in DataFusion 28.0.0 the output of
timestamp
-timestamp
is a interval and is displayed like this:However, without any additional changes after https://github.com/apache/arrow-datafusion/pull/6832 it is displayed in ISO 8601 duration format
I think this is problematic because the output:
select now() + interval '1 year';
)P198DT72932.972880S
), is not widely used for displaying intervals to humansCasting logic difference will means that the following query in 28.0.0
Would produce this in 29.0.0:
Describe the solution you'd like
I propose adding specific code for formatting DataFusion output, used in datafusion-cli, tests, and sqllogictests that formats Durations consistently with how
Interval
s are displayed in 28.0.0.This change will unblock the arrow upgrade in https://github.com/apache/arrow-datafusion/pull/6832 as well as give us a single location to control formatting in DataFusion making it easier to improve the formatting of intervals / durations (or other types) in the future if we desire.
Anyone using DataFusion programmatically can either choose to use this new formatting routine, or can format Durations / Intervals to their specific need using arrow-rs or custom code (as we do in IOx, for example (source))
While this solution still suffers from the
Duration
-->String
casting problem, I think it is ok because I don't think it is very common and there is a workaround (to cast the duration to an interval):For those people who want even more control of duration printing, it would probably make sense to add a specific
to_char
or similar function: https://www.postgresql.org/docs/current/functions-formatting.htmlDescribe alternatives you've considered
Change the parsing / formattting of Durations in arrow-rs
The idea is to make PRs like https://github.com/apache/arrow-rs/pull/4557 that would change the formatting in arrow-rs
Some challenges with this approach:
Special case String casting / parsing
One way to avoid issues here would be to just special case the casting and parsing (
Duration
<==>String
) in datafusion and instead of calling the arrowcast
function we could calldatafusion_cast
which would have special handling for some type and then fall back to the arrowcast
kernelThe challenges with this approach are:
datafusion_cast
rather thancast
cast
)Additional context
No response