Closed yanghua closed 10 months ago
Mix-using data lake table formats seem to have issues. i.e. the extended catalyst rules and SQL grammars have conflicts.
Hi Mr Blue, wdyt about putting the pre-downloaded table format specified dependencies into the default .ivy
spark.jars.ivy?
So when users want to play Lakehouse Suit, they can just run these command like blow (copied from hudi official spark guide):
# for spark shell:
spark-shell --packages org.apache.hudi:hudi-spark3.4-bundle_2.12:0.14.1 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' --conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog' --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' --conf 'spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar'
# for spark sql:
spark-sql --packages org.apache.hudi:hudi-spark3.4-bundle_2.12:0.14.1 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' --conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog' --conf 'spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar'
This can avoid re-downloading from the remote maven repository and would not contaminate other components.
maybe some places like /opt/hudi/xxx.jar
? Then the user could use spark-sql --jars /opt/hudi/xxx.jar
even offline.
maybe some places like
/opt/hudi/xxx.jar
? Then the user could usespark-sql --jars /opt/hudi/xxx.jar
even offline.
sounds good~
One more thing, Hudi depends on HFile as its metadata-store's inner format(MOR table). HFile belongs to hbase-server
(version 2.4.9). However, the hbase-server
depends on hadoop 2.x
. So when it integrates with hadoop 3.x
, it would occur NoSuchMethod
exception.
I must rebuild hbase with hadoop-3
profile, then rebuild hudi-spark-bundle.
More information is here: https://hudi.apache.org/docs/troubleshooting#how-can-i-resolve-the-nosuchmethoderror-from-hbase-when-using-hudi-with-metadata-table-on-hdfs
So providing the out-of-box bundle would benefit more users.
since it's not included in the classpath by default, bundle is fine.
Mix-using data lake table formats seem to have issues. i.e. the extended catalyst rules and SQL grammars have conflicts.