apache / datafusion-comet

Apache DataFusion Comet Spark Accelerator
https://datafusion.apache.org/comet
Apache License 2.0
614 stars 113 forks source link

feat: Implement Spark-compatible CAST from String to Date #383

Closed vidyasankarv closed 3 weeks ago

vidyasankarv commented 1 month ago

Which issue does this PR close?

Closes #327.

Rationale for this change

What changes are included in this PR?

How are these changes tested?

vidyasankarv commented 1 month ago

This PR is still in progress. I added support for String to Date32.

Hi @parthchandra @andygrove Can you please review if this is going in the right direction.

Questions:

andygrove commented 1 month ago

This PR is still in progress. I added support for String to Date32.

  • Spark supports dates in the format YYYY and YYYY-MM and DataFusion does not - supported now
  • Spark supports a trailing T as in 2024-01-01T and DataFusion does not - supported now
  • DataFusion doesn't throw an exception for invalid inputs in ANSI mode - returns error in ANSI mode if date cant be parsed.

Hi @parthchandra @andygrove Can you please review if this is going in the right direction.

Questions:

@vidyasankarv Yes, I would say this is going it a good direction based on a very quick review. I will try and find more time tomorrow for a deeper look. To answer your questions:

parthchandra commented 1 month ago

You're pretty close to handling these cases (see my comment in the review)

vidyasankarv commented 1 month ago

hi @parthchandra @andygrove made changes as suggested ported the date parsing logic from SparkDateTimeUtils - The previous one reads much simpler - though it was missing a couple of features like handling trialing spaces like extra spaces diff

This PR does not support fuzzy tests in CometCastSuite for test Date to String as Naive Date only supports dates in the below range and the dates generated by fuzzy test for matching with spark are out of this range and hence cause a mismatch between results.

/// The minimum possible `Na iveDate` (January 1, 262145 BCE).
#[deprecated(since = "0.4.20", note = "Use NaiveDate::MIN instead")]
pub const MIN_DATE: NaiveDate = NaiveDate::MIN;
/// The maximum possible `NaiveDate` (December 31, 262143 CE).
#[deprecated(since = "0.4.20", note = "Use NaiveDate::MAX instead")]
pub const MAX_DATE: NaiveDate = NaiveDate::MAX;

For Ansi mode for any format validation results returns an error, where as other modes return None. However when all validations pass, for dates that are beyond dates supported by NaiveDate all modes return None.

@parthchandra Regarding these points

Can you please take another look at the PR. thank you.

andygrove commented 1 month ago

Thanks @vidyasankarv. I plan on carefully reviewing this later today.

vidyasankarv commented 1 month ago

Hi @andygrove I need some help https://github.com/apache/datafusion-comet/pull/383/files#diff-41ecdd113d7a7afe33447e34f1ff0b5ed3033a89bfbcefa9e7e259d7a6e4daecR593

These particular test cases result in failure - like when i run with this sample test 2020-10-010T - the same test in rust returns CometError in ansi mode https://github.com/apache/datafusion-comet/pull/383/files#diff-b7339cca414a6315488506dd33654946f62c229feb8ad0d4abeda683ca75b4b5R1717

However when the ComeTestSuite for Test String to Date - runs it fails for the combination of comet ansi enabled without try_cast.

I have tried to concurrently debug on CLION https://github.com/apache/datafusion-comet/blob/main/docs/source/contributor-guide/debugging.md - however the breakpoints show us disabled and are nt hitting - i tried switching to lldb in the clion toolchain too but no help. I added some additional logging locally and can see that it returns a CometError for the invalid value on the rust side as expected, but returns None for Comet side when running from the scala test suite.

Is there something else I could be missing in terms of any configuration. Appreciate any help when you get some time.

Thank you

codecov-commenter commented 1 month ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 34.17%. Comparing base (14494d3) to head (ae575e5). Report is 11 commits behind head on main.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #383 +/- ## ============================================ + Coverage 34.02% 34.17% +0.15% + Complexity 857 850 -7 ============================================ Files 116 116 Lines 38565 38547 -18 Branches 8517 8523 +6 ============================================ + Hits 13120 13174 +54 + Misses 22691 22608 -83 - Partials 2754 2765 +11 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

andygrove commented 1 month ago

There is one test failure with JDK 8 / Spark 3.2:

- cast StringType to DateType *** FAILED *** (349 milliseconds)
  "[CAST_INVALID_INPUT] The value '0' of the type "STRING" cannot be cast to "DATE" because it is malformed. Correct the value as per the syntax, or change its target type. Use `try_cast` to tolerate malformed input and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error." did not contain "annot cast 0 to DateType." (CometCastSuite.scala:1069)
andygrove commented 1 month ago

@vidyasankarv I would suggest that we skip the test for now when running against Spark 3.2 and file a follow on issue to fix 3.2 compatibility (this may not be a high priority since 3.2 is quite old and we should consider dropping support for it at some point.

You can add an assume call to the test to skip for certain criteria:

  test("cast StringType to DateType") {
    assume(CometSparkSessionExtensions.isSpark33Plus)

It would be good to add a comment in here as well with a link to the follow on issue (could you file that?)

vidyasankarv commented 1 month ago

@vidyasankarv I would suggest that we skip the test for now when running against Spark 3.2 and file a follow on issue to fix 3.2 compatibility (this may not be a high priority since 3.2 is quite old and we should consider dropping support for it at some point.

You can add an assume call to the test to skip for certain criteria:

  test("cast StringType to DateType") {
    assume(CometSparkSessionExtensions.isSpark33Plus)

It would be good to add a comment in here as well with a link to the follow on issue (could you file that?)

@andygrove thank you for suggestions - filed this issue https://github.com/apache/datafusion-comet/issues/440 and linked in the test.

vidyasankarv commented 4 weeks ago

https://github.com/apache/datafusion-comet/suites/23883332179/logs?attempt=2

@andygrove This build included the fuzz test cast String to DateType - https://github.com/apache/datafusion-comet/pull/383/commits/19fa952832e1c9669371f291df70f13a8c6ba0d8 - as recommended here https://github.com/apache/datafusion-comet/pull/383#discussion_r1600745021

From the logs for ubuntu-latest/java 17-spark-3.4-scala-2.12/ - https://github.com/apache/datafusion-comet/suites/23883332179/logs?attempt=2

Additionally the same sample dates from above pass otherwise without fuzz test https://github.com/apache/datafusion-comet/pull/383/files#diff-41ecdd113d7a7afe33447e34f1ff0b5ed3033a89bfbcefa9e7e259d7a6e4daecR577-R585 and the test report also shows them as null on Comet side in the comparison when the fuzz tests are included.

https://github.com/apache/datafusion-comet/actions/runs/9123082449/job/25132235801

  ![262142-01-01,2142-01-01]               [262142-01-01,null]
  ![262142-01-01 ,2142-01-01]              [262142-01-01 ,null]
  ![262142-01-01T ,2142-01-01]             [262142-01-01T ,null]
  ![262142-01-01T 123123123,2142-01-01]    [262142-01-01T 123123123,null]
   [263,null]                              [263,null]

if you search for 262142-01-01 in the logs you can see it reports as failing on lines 14035 to 14038 as above

similarly if you also search for dates -262143-12-31 on lines 10167 to 10171

   [--262143-12-31,null]                   [--262143-12-31,null]
   [--262143-12-31T 1234 ,null]            [--262143-12-31T 1234 ,null]
  ![-262143-12-31,2144-12-31]              [-262143-12-31,null]
  ![-262143-12-31 ,2144-12-31]             [-262143-12-31 ,null]
  ![-262143-12-31T,2144-12-31]             [-262143-12-31T,null]
  ![-262143-12-31T ,2144-12-31]            [-262143-12-31T ,null]
  ![-262143-12-31T 123123123,2144-12-31]   [-262143-12-31T 123123123,null]

I have spent a fair time trying to understand why this is happening , but am unable to identify the issue.

I have added some more samples from fuzzy dates into my current unit tests in rust tests and CometCastSuite all of them are passing without the fuzz test locally. https://github.com/apache/datafusion-comet/pull/383/commits/2b4c204df60487afe6258ddc0db5901c66427ee6

So I have pushed this build removing the fuzz test now to see if the build passes. So I might need some help in trying to identify the issue with fuzz test. Apologies for taking your time on this again. Thank you for your help

andygrove commented 3 weeks ago

So I have pushed this build removing the fuzz test now to see if the build passes. So I might need some help in trying to identify the issue with fuzz test. Apologies for taking your time on this again. Thank you for your help

Thanks @vidyasankarv. I plan on looking into this tomorrow. Overall, the PR looks good.

andygrove commented 3 weeks ago

@vidyasankarv I am also very confused .. values that fail in the fuzz test work in the other test 🤔

I am debugging and will let you know when I get to the bottom of this mystery

andygrove commented 3 weeks ago

@vidyasankarv I figured out what the issue is.

I don't fully understand why, but when the fuzz test creates the DataFrame, the cast operation that gets performed is from a dictionary array not a string array:

cast_array(from=Dictionary(Int32, Utf8), to_type=Date32)

This means that we are not even calling your native date_parser but instead falling through to this catchall logic:

_ => {
    // when we have no Spark-specific casting we delegate to DataFusion
    cast_with_options(&array, to_type, &CAST_OPTIONS)?
}

The solution is to add a specific match for casting dictionary to date:

            (
                DataType::Dictionary(key_type, value_type),
                DataType::Date32,
            ) if key_type.as_ref() == &DataType::Int32
                && (value_type.as_ref() == &DataType::Utf8
                || value_type.as_ref() == &DataType::LargeUtf8) =>
            {
                match value_type.as_ref() {
                    DataType::Utf8 => {
                        let unpacked_array =
                            cast_with_options(&array, &DataType::Utf8, &CAST_OPTIONS)?;
                        Self::cast_string_to_date(&unpacked_array, to_type, self.eval_mode)?
                    }
                    DataType::LargeUtf8 => {
                        let unpacked_array =
                            cast_with_options(&array, &DataType::LargeUtf8, &CAST_OPTIONS)?;
                        Self::cast_string_to_date(&unpacked_array, to_type, self.eval_mode)?
                    }
                    dt => unreachable!(
                        "{}",
                        format!("invalid value type {dt} for dictionary-encoded string array")
                    ),
                }
            },
vidyasankarv commented 3 weeks ago

@andygrove thank you very much for looking into this. tested fuzz test with your suggestions and is working now. pushed changes in the latest commit https://github.com/apache/datafusion-comet/pull/383/commits/88af45c7757854cfc94e628e8338f299c2aec5d7.

vidyasankarv commented 3 weeks ago

@andygrove @parthchandra @kazuyukitanimura thank you for reviews and support in helping me through my first open source contribution. Its been a great learning experience. Still trying to grasp all the new things I learnt from this seemingly simple good first issue. My first exposure to rust, seeing JNI in action for interactions between spark and comet using Arrow. Hope to keep contributing. And waiting for @andygrove's updated version of his How Query Engine's Work . Thank you all.