apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
6.4k stars 1.21k forks source link

Support timestamp literals with precision specifier #7249

Open westonpace opened 1 year ago

westonpace commented 1 year ago

Is your feature request related to a problem or challenge?

Postgres supports an optional precision specifier in timestamp literals (e.g. timestamp (3) '2021-01-01 00:00:00.123' . The postgres spec technically only allows 0-6 but given that Arrow timestamps support nanoseconds it would probably be best to support 0-9.

Describe the solution you'd like

For my purposes, It would be sufficient to only support precision values of 0, 3, 6, and 9 (seconds, milliseconds, microseconds, and nanoseconds) though it should be possible to support values that aren't a multiple of 3 since the expectation is that this value is only used for parsing the literal and it is not a constraint on the type at all (e.g. a timestamp(5) could be stored at microsecond resolution as long as the string is parsed correctly).

Ideally, output would look like the following:

❯ select arrow_typeof(timestamp (6) '2021-01-01 00:00:00.123456789');
+-------------------------------------------+
| arrow_typeof(Utf8("2021-01-01 00:00:00.123456")) |
+-------------------------------------------+
| Timestamp(Microsecond, None)              |
+-------------------------------------------+

❯ select arrow_typeof(timestamp (3) '2021-01-01 00:00:00.123456789');
+-------------------------------------------+
| arrow_typeof(Utf8("2021-01-01 00:00:00.123")) |
+-------------------------------------------+
| Timestamp(Millisecond, None)              |
+-------------------------------------------+

❯ select arrow_typeof(timestamp (0) '2021-01-01 00:00:00.123456789');
+-------------------------------------------+
| arrow_typeof(Utf8("2021-01-01 00:00:00")) |
+-------------------------------------------+
| Timestamp(Second, None)                   |
+-------------------------------------------+

❯ select arrow_typeof(timestamp (9) '2021-01-01 00:00:00.123456789');
+-------------------------------------------+
| arrow_typeof(Utf8("2021-01-01 00:00:00.123456789")) |
+-------------------------------------------+
| Timestamp(Nanosecond, None)               |
+-------------------------------------------+

Example postgres output: https://www.db-fiddle.com/f/oiHdDy1v78mC1zKbCFvWdV/0

Describe alternatives you've considered

A pretty usable workaround at the moment is to cast:

# These should be equivalent
timestamp (6) '2021-01-01 00:00:00'
arrow_cast(timestamp '2021-01-01 00:00:00', 'Timestamp(Microsecond, None)')

Unfortunately, this requires df-specific functions (arrow_cast) and it would also break backwards compatibility with Lance's current SQL parsing.

Additional context

No response

fernandocast commented 1 year ago

Hello I'm new to arrow-datafusion project and I would like to contribute 😄 Is there any chance to help with this issue?

Weijun-H commented 1 year ago

Is it possible to alias this feature to arrow_trunc since they share similarities 🤔?

fernandocast commented 1 year ago

Hello, would someone invite me to datafusion slack channel?

alamb commented 1 year ago

@fernandocast -- please let me know what email you would like (either email me at alamb@influxdata.com or join the discord channel https://arrow.apache.org/datafusion/contributor-guide/communication.html#slack-and-discord and ask there

alamb commented 1 year ago

This sounds like a nice feature to me, FWIW

It looks like sqlparser already supports the feature https://docs.rs/sqlparser/0.36.1/sqlparser/ast/enum.DataType.html#variant.Timestamp

So it would be a matter of hooking up DataFusion to it

findepi commented 1 month ago

Postgres supports an optional precision specifier in timestamp literals (e.g. timestamp (3) '2021-01-01 00:00:00.123' .

I don't think this should be necessary. From literal precision inference perspective, TIMESTAMP literals are not different from DECIMAL or varchar. The literal's precision should be reflected in the literal's type. TIMESTAMP '2021-01-01 00:00:00.123' clearly has millisecond precision. requiring user to add (3) part is redundant. if a user wants the literal to be parsed with specific precision, they can use CAST('2021-01-01 00:00:00.123' AS timestamp(p)) or shorter '2021-01-01 00:00:00.123'::timestamp(p).