apache / datafusion-comet

Apache DataFusion Comet Spark Accelerator
https://datafusion.apache.org/comet
Apache License 2.0
823 stars 163 forks source link

feat: support array_insert #1073

Open SemyonSinchenko opened 1 week ago

SemyonSinchenko commented 1 week ago

Which issue does this PR close?

Related to #1042

array_insert: SELECT array_insert(array(1, 2, 3, 4), 5, 5)

Rationale for this change

As described in #1042

What changes are included in this PR?

How are these changes tested?

At the moment I added a simple tests for fn array_insert and a test for QueryPlanSerde.

SemyonSinchenko commented 1 week ago

As I was able to realize, array_insert does not supported in datafusion. Is the list.rs a good place to have an implementation of ArrayInsert and PhysicalExpr for it?

SemyonSinchenko commented 1 week ago

@andygrove Sorry for tagging but I have questions about the ticket (array_insert).

  1. [RESOLVED] array_insert was added in spark 3.4, so all the 3.3.x tests are obviously failed. I checked and it looks like the EoL for 3.3 is about the end of 2024. Technically I think I can try to workaround tests in 3.3.x by reflection API, my question is mostly should I do it due to soon EoL of the 3.3.x?
  2. array_insert is not supported in DataFusion. I made an implementation (and it looks like it works, except negative indices and corner cases). Is the list.rs a good place for it? Or should I move my code somewhere else?
  3. Spark does not support anything except Int32 for position argument, is it OK if I will support only int32 too? In theory, other types can be supported too, but I'm still trying to realize how to achieve it and it may become complex...

Thanks in advance! That is my first serious attempt to contribute to the project, so sorry If I'm annoying.

andygrove commented 1 week ago

Thanks, @SemyonSinchenko. I think it's fine to skip the test for Spark 3.3. I plan on reviewing this PR in more detail tomorrow, but it looks good from an initial read.

codecov-commenter commented 1 week ago

Codecov Report

Attention: Patch coverage is 59.09091% with 9 lines in your changes missing coverage. Please review.

Project coverage is 34.27%. Comparing base (845b654) to head (8b58d8d). Report is 18 commits behind head on main.

Files with missing lines Patch % Lines
.../scala/org/apache/comet/serde/QueryPlanSerde.scala 59.09% 7 Missing and 2 partials :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #1073 +/- ## ============================================ - Coverage 34.46% 34.27% -0.20% Complexity 888 888 ============================================ Files 113 113 Lines 43580 43355 -225 Branches 9658 9488 -170 ============================================ - Hits 15021 14860 -161 - Misses 25507 25596 +89 + Partials 3052 2899 -153 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

SemyonSinchenko commented 5 days ago

This PR is ready for the review. The only failed check failed due to the internal GHA error:

GitHub Actions has encountered an internal error when running your job.
NoeB commented 2 days ago

I am not sure if this should be done together with this PR but it would some add "free" tests. Spark introduced with 3.5 array_prepend which it implements with array_insert and starting 4.0 it also implements array_append with array_insert. If you want you can copy the array_append tests and replace array_append with array_prepend for Spark 3.5+ and enable the array_append tests for spark 4.0. You can also ignore this comment if you do not agree or if it leads to unrelated errors.

SemyonSinchenko commented 1 day ago

Thanks for comments and suggestions!

This PR is ready for the review again.

What were changed from the last round of the review:

These tests pointed my attention to the uncovered parts of the code that I fixed and I also added couple of additional tests to the native part. I revisited how spark do array_insert and it is more tricky than I realized at the first glance. I added few additional comments to the native part of the implementation and fixed the behavior.

At the moment this PR is tested in multiple ways:

So, it looks to me, that all the possible cases are covered and the behavior is the same like in spark.

SemyonSinchenko commented 1 day ago

I am not sure if this should be done together with this PR but it would some add "free" tests. Spark introduced with 3.5 array_prepend which it implements with array_insert and starting 4.0 it also implements array_append with array_insert. If you want you can copy the array_append tests and replace array_append with array_prepend for Spark 3.5+ and enable the array_append tests for spark 4.0. You can also ignore this comment if you do not agree or if it leads to unrelated errors.

Done!