awesome-kyuubi / hadoop-testing

Testing Sandbox for Hadoop Ecosystem Components
Apache License 2.0
32 stars 13 forks source link

Add Hudi component into hadoop-testing #15

Closed yanghua closed 10 months ago

pan3793 commented 10 months ago

Mix-using data lake table formats seem to have issues. i.e. the extended catalyst rules and SQL grammars have conflicts.

yanghua commented 10 months ago

Mix-using data lake table formats seem to have issues. i.e. the extended catalyst rules and SQL grammars have conflicts.

Hi Mr Blue, wdyt about putting the pre-downloaded table format specified dependencies into the default .ivy spark.jars.ivy?

So when users want to play Lakehouse Suit, they can just run these command like blow (copied from hudi official spark guide):

# for spark shell:
spark-shell --packages org.apache.hudi:hudi-spark3.4-bundle_2.12:0.14.1 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' --conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog' --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' --conf 'spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar'

# for spark sql:
spark-sql --packages org.apache.hudi:hudi-spark3.4-bundle_2.12:0.14.1 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' --conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog' --conf 'spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar'

This can avoid re-downloading from the remote maven repository and would not contaminate other components.

pan3793 commented 10 months ago

maybe some places like /opt/hudi/xxx.jar? Then the user could use spark-sql --jars /opt/hudi/xxx.jar even offline.

yanghua commented 10 months ago

maybe some places like /opt/hudi/xxx.jar? Then the user could use spark-sql --jars /opt/hudi/xxx.jar even offline.

sounds good~

yanghua commented 10 months ago

One more thing, Hudi depends on HFile as its metadata-store's inner format(MOR table). HFile belongs to hbase-server(version 2.4.9). However, the hbase-server depends on hadoop 2.x. So when it integrates with hadoop 3.x, it would occur NoSuchMethod exception.

I must rebuild hbase with hadoop-3 profile, then rebuild hudi-spark-bundle.

More information is here: https://hudi.apache.org/docs/troubleshooting#how-can-i-resolve-the-nosuchmethoderror-from-hbase-when-using-hudi-with-metadata-table-on-hdfs

So providing the out-of-box bundle would benefit more users.

pan3793 commented 10 months ago

since it's not included in the classpath by default, bundle is fine.