Closed pan3793 closed 2 years ago
may I know which hudi bundle or artifact you are using? ' with 0.10.1, for spark3, the bundle names have changed. hudi-spark3.0.3-bundle hudi-spark3.1.2-bundle
in previous release it was hudi-spark3-bundle*
@nsivabalan thanks for your reply.
may I know which hudi bundle or artifact you are using?
We use the vanilla jars instead of the bundle jar because
Hudi bundle jar name contains the exactly Spark patched version, e.g. hudi-spark3.1.2-bundle*
, if we choose it, what if we want to upgrade Spark version to 3.1.3(voting phase), do we need to wait/ask Hudi community to publish the hudi-spark3.1.3-bundle*
jar?
Hudi bundle jar contains lots of classes from transitive dependencies WITHOUT relocation, which makes a high risk of class conflict if the user also provides the original jars, e.g. kotlin
, curator
.
I think Hudi has room to improve the bundle jar to reduce dependency maintenance effort for users/downstream projects. Compared to other data lake formats, delta restricts to involve dependencies other than spark, the delta-core has only one transitive dependency jackson-core-asl
which is not included in spark runtime jars. Iceberg provides runtime
jar which is something like Hudi bundle jars but has such differences:
curator
org.apache.iceberg
package to avoid potential class conflict with user classes.iceberg-spark-runtime-0.13.0.jar
for spark 2.4.x, iceberg-spark3-runtime-0.13.0.jar
from spark 3.0.x, iceberg-spark-runtime-3.1_2.12-0.13.0.jar
for spark 3.1.x, iceberg-spark-runtime-3.2_2.12-0.13.0.jar
for spark 3.2.x@xushiyan : Can you follow up here please. looks like good suggestions that can be taken into consideration. I will let you drive this. let me know if you need to jam.
@pan3793 : are you folks still blocked on this?
No progress yet.
@pan3793 thanks for the feedback. for looking into this issue, can you post which vanilla jars you used? since this is a spark version conflict, it'll help us analyze.
loop in @XuQianJin-Stars here. maybe we can start driving an epic for this topic: improve dependency bundling, based on the feedback above
@pan3793 did you build the vanilla jars with spark 3 profile? since there is a spark version mismatch.
- get tables *** FAILED ***
java.sql.SQLException: Error operating EXECUTE_STATEMENT: java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.catalog.CatalogTable.copy(Lorg/apache/spark/sql/catalyst/TableIdentifier;Lorg/apache/spark/sql/catalyst/catalog/CatalogTableType;Lorg/apache/spark/sql/catalyst/catalog/CatalogStorageFormat;Lorg/apache/spark/sql/types/StructType;Lscala/Option;Lscala/collection/Seq;Lscala/Option;Ljava/lang/String;JJLjava/lang/String;Lscala/collection/immutable/Map;Lscala/Option;Lscala/Option;Lscala/Option;Lscala/collection/Seq;ZZLscala/collection/immutable/Map;)Lorg/apache/spark/sql/catalyst/catalog/CatalogTable;
at org.apache.spark.sql.hudi.command.CreateHoodieTableCommand$.createTableInCatalog(CreateHoodieTableCommand.scala:136)
@YannByron do you happen to know what might be mismatched here?
@xushiyan thanks for helping, and sorry I didn't notice your first reply.
can you post which vanilla jars you used?
Basically, we use the following jars related to Hudi, you can check more details in our project source code https://github.com/apache/incubator-kyuubi
did you build the vanilla jars with spark 3 profile?
No, we use the jar published by Hudi officially, it works fine on Hudi 0.10.0, but not Hudi 0.10.1
s"""
| create table $table (
| id int,
| name string,
| price double,
| ts long
| ) using $format
| options (
| primaryKey = 'id',
| preCombineField = 'ts',
| hoodie.bootstrap.index.class =
| 'org.apache.hudi.common.bootstrap.index.NoOpBootstrapIndex'
| )
""".stripMargin
See it's hudi 0.10.1 and spark3.1.2 in https://github.com/apache/incubator-kyuubi/pull/1897/files. My test is ok in the same env, but i use bundle jar. In addition to the vanilla jar, what other hudi jars you put in the spark env ?
And, is there the same problem if you use spark-sql instead of kyuubi ? @pan3793
@YannByron
what other hudi jars you put in the spark env
is there the same problem if you use spark-sql instead of kyuubi
Basically, the component kyuubi-spark-sql-engine
is just a Spark application which start a thrift server in Spark driver process and exposes interface which compatible with HiveServer2 thrift protocol. In this case, it just play a role forward the user SQL to spark.sql(xxx)
, I don't think the result will be different with running SQL using spark.sql(xxx)
directly, but let me try.
@YannByron FYI, after switching to spark.sql(xxx)
the SQL failed with same error.
test("hudi 0.10.1") {
val spark = SparkSession.builder()
.config("spark.sql.catalogImplementation", "in-memory")
.config("spark.sql.defaultCatalog", "spark_catalog")
.config("spark.sql.extensions", "org.apache.spark.sql.hudi.HoodieSparkSessionExtension")
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.getOrCreate()
spark.sql(
s"""
| create table hudi_tbl (
| id int,
| name string,
| price double,
| ts long
| ) using hudi
| options (
| primaryKey = 'id',
| preCombineField = 'ts',
| hoodie.bootstrap.index.class =
| 'org.apache.hudi.common.bootstrap.index.NoOpBootstrapIndex'
| )
""".stripMargin)
}
Looks like the test got fixed after upgrade to 0.11.0. https://github.com/apache/incubator-kyuubi/commit/cb5f49e3e9bf4afed100e756302b69879faf5e61 Please reopen if you still see the issue.
Describe the problem you faced
The Apache Kyuubi (Incubating) Hudi Integration test broken after upgrading from 0.10.0 to 0.10.1.
https://github.com/apache/incubator-kyuubi/runs/5152924363
To Reproduce
The TEST CASE
https://github.com/apache/incubator-kyuubi/pull/1897
Expected behavior
Test case pass as same as Hudi 0.10.0.
Environment Description
Hudi version : 0.10.1
Spark version : 3.1.2
Hive version : 2.3.7
Hadoop version : 3.3.1
Storage (HDFS/S3/GCS..) : Local File System
Running on Docker? (yes/no) : Not sure, failed in GitHub Action.
Additional context
Stacktrace