apache / paimon

Apache Paimon is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch operations.
https://paimon.apache.org/
Apache License 2.0
2.4k stars 946 forks source link

[Feature] [Spark] Support Spark 4.0 (preview) #3940

Open YannByron opened 2 months ago

YannByron commented 2 months ago

Search before asking

Motivation

Support Spark4.0 (preview1)

Solution

No response

Anything else?

No response

Are you willing to submit a PR?

YannByron commented 2 months ago

@ulysses-you we can discuss this here.

ulysses-you commented 2 months ago

thank you @YannByron for the guide.

I looked at Spark 4.0.0-preview, the main challenge is scala2.13. Others like JDK17, inferface changes are not big issues.

For scala2.13, as far as I can see, Spark community paid a huge cost to support it and drop the scala2.12, and even for now there are some performance regression due to scala2.13, so I think it affects Paimon much.

For my self, I perfer to copy paimon-spark-common to a new module paimon-spark-4.0, so that we did not need to touch previous Spark version code. We can focus on the support with Spark 4.0.0 and higher version (may create paimon-spark-4-common if necessary).

cc @JingsongLi what do you think about?

awol2005ex commented 2 months ago

You can use "com.thoughtworks.enableIf" for multi versions of scala

JingsongLi commented 1 month ago

Hi @ulysses-you @YannByron , I would like to ask whether paimon-spark-4-common and paimon-spark-common can reuse most of the code. I believe Spark 3 has very long-term support, and we also need to support Spark 4. If we end up copying a lot of code in this process, it will result in maintaining two separate codebases, which can be very costly. Therefore, my concern is whether we can reuse a significant portion of the code.

YannByron commented 1 month ago

maybe we can allow paimon-spark-common to support both scala.version and spark.version properties(scala-2.12 and spark 3.5.2 by default), that make paimon-spark-common compatible spark 3.5 and 4.x.Then provide a profile in top-level pom to compile paimon-spark.

This approach doesn't allow compile both spark 3.x and spark4.x at the same time and we have to modify something like CI. But this can avoid copying codes and make more reuse.

Meanwhile, paimon-spark3-common and paimon-spark4-common can be derived from paimon-spark-common easily if required.

@JingsongLi @ulysses-you WDYT~

ulysses-you commented 1 month ago

The main issue of reuse module to me is we need to compile spark twice for different scala version. But I'm +1 for @YannByron if you are fine with it.

JingsongLi commented 1 month ago

@YannByron This approach just like Flink with two scala versions. I am OK with it~