apache / datafusion-comet

Apache DataFusion Comet Spark Accelerator
https://datafusion.apache.org/comet
Apache License 2.0
447 stars 100 forks source link

feat: Implement Spark-compatible CAST from string to timestamp types #335

Closed vaibhawvipul closed 2 weeks ago

vaibhawvipul commented 3 weeks ago

Which issue does this PR close?

Closes #328 .

Rationale for this change

Improve compatibility with Apache Spark

What changes are included in this PR?

Add custom implementation of CAST from string to timestamp rather than delegate to DataFusion

How are these changes tested?

andygrove commented 3 weeks ago

@vaibhawvipul You can run make format to fix the linting issues that are causing some CI tests to fail

vaibhawvipul commented 2 weeks ago

Update -

Working on this bug, there is 8 hours difference in comet vs spark cast. If anyone has any pointer, please let me know.

Screenshot 2024-04-29 at 10 19 56 AM
andygrove commented 2 weeks ago

Update -

Working on this bug, there is 8 hours difference in comet vs spark cast. If anyone has any pointer, please let me know.

Screenshot 2024-04-29 at 10 19 56 AM

What happens if you explicitly set the timezone to UTC in the Spark configs?

We may want to fallback to Spark for other timezones for now.

vaibhawvipul commented 2 weeks ago

Thanks @andygrove , setting timestamp as UTC worked!

Any idea about this error?

test("cast string to timestamp") {
    withSQLConf(
      (SQLConf.SESSION_LOCAL_TIMEZONE.key -> "UTC"),
      (CometConf.COMET_CAST_STRING_TO_TIMESTAMP.key -> "true")) {
      val values = Seq("2020-01-01T12:34:56.123456", "T2").toDF("a")
      castTest(values, DataTypes.TimestampType)
    }
  }
Screenshot 2024-04-29 at 10 23 06 PM
vaibhawvipul commented 2 weeks ago

seems like, test is failing in ANSI mode, debugging that.

vaibhawvipul commented 2 weeks ago

@andygrove

I have enabled a "string to timestamp" cast test case that passes. We are supporting all the timestamps mentioned in the issue.

[

Screenshot 2024-04-30 at 12 30 35 PM

](url)

Input Spark Comet
2020 2020-01-01 00:00:00.0 2020-01-01 00:00:00.0
2020-01 2020-01-01 00:00:00.0 2020-01-01 00:00:00.0
2020-01-01 2020-01-01 00:00:00.0 2020-01-01 00:00:00.0
2020-01-01T12 2020-01-01 12:00:00.0 2020-01-01 12:00:00.0
2020-01-01T12:34 2020-01-01 12:34:00.0 2020-01-01 12:34:00.0
2020-01-01T12:34:56 2020-01-01 12:34:56.0 2020-01-01 12:34:56.0
2020-01-01T12:34:56.123456 2020-01-01 12:34:56.123456 2020-01-01 12:34:56.123456
T2 2024-04-25 02:00:00.0 2024-04-25 02:00:00.0

Just a note -

andygrove commented 2 weeks ago

seems like, test is failing in ANSI mode, debugging that.

There is an improved version of the ANSI testing in https://github.com/apache/datafusion-comet/pull/351

andygrove commented 2 weeks ago

Thanks @vaibhawvipul this is looking great. I will review today.

andygrove commented 2 weeks ago
  • Spark supports some additional inputs like - "213" or "85678". i.e 3 digit and 5 digit years too! This is not being supported right now by Comet in this PR.

I would be fine with handling this in a follow on PR

andygrove commented 2 weeks ago

This is looking great @vaibhawvipul. I think this is close to being ready to merge and then have some follow on issues for remaining items.

I think the one thing I would like to see before merging is that we can run the fuzz test without causing any panics, so just replacing unwraps with error handling.

vaibhawvipul commented 2 weeks ago

This is looking great @vaibhawvipul. I think this is close to being ready to merge and then have some follow on issues for remaining items.

I think the one thing I would like to see before merging is that we can run the fuzz test without causing any panics, so just replacing unwraps with error handling.

@andygrove all unwraps are removed, I am handling errors in all of my functions.

parthchandra commented 2 weeks ago

As a follow up, I would also suggest adding/verifying support for any timezone. (See this test for instance: https://github.com/apache/datafusion-comet/blob/064cb472a7f55b6b67f44c22d6a8110320b330e7/spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala#L262)

Also see - https://docs.rs/arrow-cast/50.0.0/arrow_cast/parse/fn.string_to_datetime.html

andygrove commented 2 weeks ago

I filed https://github.com/apache/datafusion-comet/issues/376 for the follow on work