apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
6.26k stars 1.18k forks source link

Special case Duration display in datafusion-cli and sqllogictest #7070

Closed alamb closed 1 year ago

alamb commented 1 year ago

Is your feature request related to a problem or challenge?

In https://github.com/apache/arrow-datafusion/pull/6832 and https://github.com/apache/arrow-datafusion/issues/7068 , @tustvold proposes to use Duration consistently for timestamp arithmetic which makes the code simpler and consistent across DataFusion

However, one implication of this change, at the time of this writigng, is shown https://github.com/apache/arrow-datafusion/pull/6832#discussion_r1267234606

Specifically, in DataFusion 28.0.0 the output of timestamp - timestamp is a interval and is displayed like this:


❯ select (now() - '2023-01-01'::timestamp);
+-----------------------------------------------------------+
| now() - Utf8("2023-01-01")                                |
+-----------------------------------------------------------+
| 0 years 0 mons 204 days 16 hours 16 mins 8.121077000 secs |
+-----------------------------------------------------------+
1 row in set. Query took 0.004 seconds.

However, without any additional changes after https://github.com/apache/arrow-datafusion/pull/6832 it is displayed in ISO 8601 duration format

❯ select (now() - '2023-01-01'::timestamp);
+----------------------------+
| now() - Utf8("2023-01-01") |
+----------------------------+
| P198DT72932.972880S        |
+----------------------------+
1 row in set. Query took 0.002 seconds.

I think this is problematic because the output:

  1. Looks wildly different than 28.0.0
  2. Looks wildly different than how intervals are expressed in queries (e.g. select now() + interval '1 year';)
  3. Will likely cause significant confusion as the ISO duration format, (e.g. P198DT72932.972880S), is not widely used for displaying intervals to humans
  4. The casting logic to strings will be different (see below)

Casting logic difference will means that the following query in 28.0.0

❯ select (now() - '2023-01-01'::timestamp)::varchar || ' ago';
+----------------------------------------------------------------+
| now() - Utf8("2023-01-01") || Utf8(" ago")                     |
+----------------------------------------------------------------+
| 0 years 0 mons 204 days 16 hours 34 mins 15.407810000 secs ago |
+----------------------------------------------------------------+

Would produce this in 29.0.0:

+----------------------------------------------------------------+
| now() - Utf8("2023-01-01") || Utf8(" ago")                     |
+----------------------------------------------------------------+
| P198DT72932.972880S ago |
+----------------------------------------------------------------+

Describe the solution you'd like

I propose adding specific code for formatting DataFusion output, used in datafusion-cli, tests, and sqllogictests that formats Durations consistently with how Intervals are displayed in 28.0.0.

This change will unblock the arrow upgrade in https://github.com/apache/arrow-datafusion/pull/6832 as well as give us a single location to control formatting in DataFusion making it easier to improve the formatting of intervals / durations (or other types) in the future if we desire.

Anyone using DataFusion programmatically can either choose to use this new formatting routine, or can format Durations / Intervals to their specific need using arrow-rs or custom code (as we do in IOx, for example (source))

While this solution still suffers from the Duration --> String casting problem, I think it is ok because I don't think it is very common and there is a workaround (to cast the duration to an interval):

❯ select (now() - '2023-01-01'::timestamp)::interval::varchar || ' ago';
+----------------------------------------------------------------+
| now() - Utf8("2023-01-01") || Utf8(" ago")                     |
+----------------------------------------------------------------+
| 0 years 0 mons 204 days 16 hours 34 mins 15.407810000 secs ago |
+----------------------------------------------------------------+

For those people who want even more control of duration printing, it would probably make sense to add a specific to_char or similar function: https://www.postgresql.org/docs/current/functions-formatting.html

Describe alternatives you've considered

Change the parsing / formattting of Durations in arrow-rs

The idea is to make PRs like https://github.com/apache/arrow-rs/pull/4557 that would change the formatting in arrow-rs

Some challenges with this approach:

  1. It is not clear if there is an "ideal" canonical format is for durations
  2. It would be a backwards compatible change that would likely impact other projects than DataFusion
  3. Existing data that was serialized with Durations (e.g. JSON or CSV) may be impacted

Special case String casting / parsing

One way to avoid issues here would be to just special case the casting and parsing (Duration <==> String) in datafusion and instead of calling the arrow cast function we could call datafusion_cast which would have special handling for some type and then fall back to the arrow cast kernel

The challenges with this approach are:

  1. It would be brittle (it would be hard to ensure that all new code called datafusion_cast rather than cast
  2. It is not clear that it would be entirely consistent (e.g. if there is code in arrow itself that calls cast)

Additional context

No response

alamb commented 1 year ago

@waitingkuo / @liukun4515 / @ozankabak I wonder if you have any thoughts about this proposal (related to reducing the impact of https://github.com/apache/arrow-datafusion/issues/7068)

ozankabak commented 1 year ago

I will discuss with my team and share our thoughts tomorrow.

berkaysynnada commented 1 year ago

I think the way you suggested seems to be the most rational option. Since we do not use year and month units for durations, we can maybe ignore those parts while formatting.

alamb commented 1 year ago

I believe @tustvold plans to handle this next week

tustvold commented 1 year ago

Thinking about this a bit more, I think it would be cleanest if we added support for this to FormatOptions, this can then be plumbed through CastOptions.

https://github.com/apache/arrow-rs/pull/4581 adds the necessary upstream changes

tustvold commented 1 year ago

Closed by #6832