Closed vidyasankarv closed 3 weeks ago
This PR is still in progress. I added support for String to Date32.
Hi @parthchandra @andygrove Can you please review if this is going in the right direction.
Questions:
This PR is still in progress. I added support for String to Date32.
- Spark supports dates in the format YYYY and YYYY-MM and DataFusion does not - supported now
- Spark supports a trailing T as in 2024-01-01T and DataFusion does not - supported now
- DataFusion doesn't throw an exception for invalid inputs in ANSI mode - returns error in ANSI mode if date cant be parsed.
Hi @parthchandra @andygrove Can you please review if this is going in the right direction.
Questions:
Need some pointers on fuzz tests. If I understand correctly
- "-0973250", "-3638-5" fuzz tests in Legacy mode should return values as mentioned Implement Spark-compatible CAST from String to Date #327 - currently legacy mode returns null.
- Do we have to support Date64 as well ?
@vidyasankarv Yes, I would say this is going it a good direction based on a very quick review. I will try and find more time tomorrow for a deeper look. To answer your questions:
- "-0973250", "-3638-5" fuzz tests in Legacy mode should return values as mentioned Implement Spark-compatible CAST from String to Date #327 - currently legacy mode returns null.
You're pretty close to handling these cases (see my comment in the review)
hi @parthchandra @andygrove made changes as suggested ported the date parsing logic from SparkDateTimeUtils - The previous one reads much simpler - though it was missing a couple of features like handling trialing spaces like extra spaces diff
This PR does not support fuzzy tests in CometCastSuite for test Date to String
as Naive Date only supports dates in the below range and the dates generated by fuzzy test for matching with spark are out of this range and hence cause a mismatch between results.
/// The minimum possible `Na iveDate` (January 1, 262145 BCE).
#[deprecated(since = "0.4.20", note = "Use NaiveDate::MIN instead")]
pub const MIN_DATE: NaiveDate = NaiveDate::MIN;
/// The maximum possible `NaiveDate` (December 31, 262143 CE).
#[deprecated(since = "0.4.20", note = "Use NaiveDate::MAX instead")]
pub const MAX_DATE: NaiveDate = NaiveDate::MAX;
For Ansi mode for any format validation results returns an error, where as other modes return None.
However when all validations pass, for dates that are beyond dates supported by NaiveDate
all modes return None.
@parthchandra Regarding these points
I honestly don't know what a string with a 'plus/minus' at the beginning of the date even means but you might want to handle that case - found something here
Also, the max number of digits allowed for the year is 7 - supporting daes upto 7 numbers would be beyond NaiveDates support - so this will be point of difference between native spark and spark with comet
Can you please take another look at the PR. thank you.
Thanks @vidyasankarv. I plan on carefully reviewing this later today.
Hi @andygrove I need some help https://github.com/apache/datafusion-comet/pull/383/files#diff-41ecdd113d7a7afe33447e34f1ff0b5ed3033a89bfbcefa9e7e259d7a6e4daecR593
These particular test cases result in failure - like when i run with this sample test 2020-10-010T
- the same test in rust returns CometError
in ansi mode https://github.com/apache/datafusion-comet/pull/383/files#diff-b7339cca414a6315488506dd33654946f62c229feb8ad0d4abeda683ca75b4b5R1717
However when the ComeTestSuite for Test String to Date - runs it fails for the combination of comet ansi enabled without try_cast.
I have tried to concurrently debug on CLION https://github.com/apache/datafusion-comet/blob/main/docs/source/contributor-guide/debugging.md - however the breakpoints show us disabled and are nt hitting - i tried switching to lldb in the clion toolchain too but no help. I added some additional logging locally and can see that it returns a CometError for the invalid value on the rust side as expected, but returns None for Comet side when running from the scala test suite.
Is there something else I could be missing in terms of any configuration. Appreciate any help when you get some time.
Thank you
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 34.17%. Comparing base (
14494d3
) to head (ae575e5
). Report is 11 commits behind head on main.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
There is one test failure with JDK 8 / Spark 3.2:
- cast StringType to DateType *** FAILED *** (349 milliseconds)
"[CAST_INVALID_INPUT] The value '0' of the type "STRING" cannot be cast to "DATE" because it is malformed. Correct the value as per the syntax, or change its target type. Use `try_cast` to tolerate malformed input and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error." did not contain "annot cast 0 to DateType." (CometCastSuite.scala:1069)
@vidyasankarv I would suggest that we skip the test for now when running against Spark 3.2 and file a follow on issue to fix 3.2 compatibility (this may not be a high priority since 3.2 is quite old and we should consider dropping support for it at some point.
You can add an assume
call to the test to skip for certain criteria:
test("cast StringType to DateType") {
assume(CometSparkSessionExtensions.isSpark33Plus)
It would be good to add a comment in here as well with a link to the follow on issue (could you file that?)
@vidyasankarv I would suggest that we skip the test for now when running against Spark 3.2 and file a follow on issue to fix 3.2 compatibility (this may not be a high priority since 3.2 is quite old and we should consider dropping support for it at some point.
You can add an
assume
call to the test to skip for certain criteria:test("cast StringType to DateType") { assume(CometSparkSessionExtensions.isSpark33Plus)
It would be good to add a comment in here as well with a link to the follow on issue (could you file that?)
@andygrove thank you for suggestions - filed this issue https://github.com/apache/datafusion-comet/issues/440 and linked in the test.
https://github.com/apache/datafusion-comet/suites/23883332179/logs?attempt=2
@andygrove
This build included the fuzz test cast String to DateType
- https://github.com/apache/datafusion-comet/pull/383/commits/19fa952832e1c9669371f291df70f13a8c6ba0d8 - as recommended here https://github.com/apache/datafusion-comet/pull/383#discussion_r1600745021
From the logs for ubuntu-latest/java 17-spark-3.4-scala-2.12/ - https://github.com/apache/datafusion-comet/suites/23883332179/logs?attempt=2
Additionally the same sample dates from above pass otherwise without fuzz test https://github.com/apache/datafusion-comet/pull/383/files#diff-41ecdd113d7a7afe33447e34f1ff0b5ed3033a89bfbcefa9e7e259d7a6e4daecR577-R585 and the test report also shows them as null on Comet side in the comparison when the fuzz tests are included.
https://github.com/apache/datafusion-comet/actions/runs/9123082449/job/25132235801
![262142-01-01,2142-01-01] [262142-01-01,null]
![262142-01-01 ,2142-01-01] [262142-01-01 ,null]
![262142-01-01T ,2142-01-01] [262142-01-01T ,null]
![262142-01-01T 123123123,2142-01-01] [262142-01-01T 123123123,null]
[263,null] [263,null]
if you search for 262142-01-01
in the logs you can see it reports as failing on lines 14035 to 14038 as above
similarly if you also search for dates -262143-12-31
on lines 10167 to 10171
[--262143-12-31,null] [--262143-12-31,null]
[--262143-12-31T 1234 ,null] [--262143-12-31T 1234 ,null]
![-262143-12-31,2144-12-31] [-262143-12-31,null]
![-262143-12-31 ,2144-12-31] [-262143-12-31 ,null]
![-262143-12-31T,2144-12-31] [-262143-12-31T,null]
![-262143-12-31T ,2144-12-31] [-262143-12-31T ,null]
![-262143-12-31T 123123123,2144-12-31] [-262143-12-31T 123123123,null]
I have spent a fair time trying to understand why this is happening , but am unable to identify the issue.
I have added some more samples from fuzzy dates into my current unit tests in rust tests and CometCastSuite all of them are passing without the fuzz test locally. https://github.com/apache/datafusion-comet/pull/383/commits/2b4c204df60487afe6258ddc0db5901c66427ee6
So I have pushed this build removing the fuzz test now to see if the build passes. So I might need some help in trying to identify the issue with fuzz test. Apologies for taking your time on this again. Thank you for your help
So I have pushed this build removing the fuzz test now to see if the build passes. So I might need some help in trying to identify the issue with fuzz test. Apologies for taking your time on this again. Thank you for your help
Thanks @vidyasankarv. I plan on looking into this tomorrow. Overall, the PR looks good.
@vidyasankarv I am also very confused .. values that fail in the fuzz test work in the other test 🤔
I am debugging and will let you know when I get to the bottom of this mystery
@vidyasankarv I figured out what the issue is.
I don't fully understand why, but when the fuzz test creates the DataFrame, the cast operation that gets performed is from a dictionary array not a string array:
cast_array(from=Dictionary(Int32, Utf8), to_type=Date32)
This means that we are not even calling your native date_parser but instead falling through to this catchall logic:
_ => {
// when we have no Spark-specific casting we delegate to DataFusion
cast_with_options(&array, to_type, &CAST_OPTIONS)?
}
The solution is to add a specific match for casting dictionary to date:
(
DataType::Dictionary(key_type, value_type),
DataType::Date32,
) if key_type.as_ref() == &DataType::Int32
&& (value_type.as_ref() == &DataType::Utf8
|| value_type.as_ref() == &DataType::LargeUtf8) =>
{
match value_type.as_ref() {
DataType::Utf8 => {
let unpacked_array =
cast_with_options(&array, &DataType::Utf8, &CAST_OPTIONS)?;
Self::cast_string_to_date(&unpacked_array, to_type, self.eval_mode)?
}
DataType::LargeUtf8 => {
let unpacked_array =
cast_with_options(&array, &DataType::LargeUtf8, &CAST_OPTIONS)?;
Self::cast_string_to_date(&unpacked_array, to_type, self.eval_mode)?
}
dt => unreachable!(
"{}",
format!("invalid value type {dt} for dictionary-encoded string array")
),
}
},
@andygrove thank you very much for looking into this. tested fuzz test with your suggestions and is working now. pushed changes in the latest commit https://github.com/apache/datafusion-comet/pull/383/commits/88af45c7757854cfc94e628e8338f299c2aec5d7.
@andygrove @parthchandra @kazuyukitanimura thank you for reviews and support in helping me through my first open source contribution. Its been a great learning experience. Still trying to grasp all the new things I learnt from this seemingly simple good first issue
. My first exposure to rust, seeing JNI in action for interactions between spark and comet using Arrow. Hope to keep contributing. And waiting for @andygrove's updated version of his How Query Engine's Work
. Thank you all.
Which issue does this PR close?
Closes #327.
Rationale for this change
What changes are included in this PR?
How are these changes tested?