apache / incubator-xtable

Apache XTable (incubating) is a cross-table converter for lakehouse table formats that facilitates interoperability across data processing systems and query engines.
https://xtable.apache.org/
Apache License 2.0
762 stars 116 forks source link

Adding in demo of xtable with S3 + HMS converting hudi 0.14.1 to delta lake and iceberg #459

Open alberttwong opened 3 weeks ago

alberttwong commented 3 weeks ago

Demo for https://github.com/apache/incubator-xtable/issues/338

What is the purpose of the pull request

Using the xtable docker demo as the base, modify it so it works with S3. End to End example with readme doc.

Brief change log

  1. added minio container images to provide an object store
  2. changed HMS image to use the Starburst HMS image because Starburst has the S3 libraries already built in to the image.
  3. built a custom spark 3.4 container image based on JDK 11 with hadoop 2.10.2 and hive 2.3.10 (can't use 2.3.1 due to hive 2.3.1 bug) installed. Available at https://hub.docker.com/r/atwong/openjdk-11-spark-3.4-hive-2.3.10-hadoop-2.10.2 if you dont' want to build it.
  4. git clone hudi and compile mvn with JDK 8 so you can get the hudi-hive-sync jars (you can skip this through hudi-hive-sync-bundle on mvnrepository.com)
  5. adding missing libraries to run run_sync_tool.sh. https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk-s3, https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws, https://mvnrepository.com/artifact/com.esotericsoftware/kryo-shaded/4.0.2, https://mvnrepository.com/artifact/org.apache.parquet/parquet-avro, https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-client
  6. modifications to iceberg, hudi and delta Trino catalog configurations to support S3 bucket lookups
  7. added core-site.xml to inject parameters to xtable and modified /etc/hadoop/core-site.xml to jnject parameters to hudi-hive-sync tool
  8. Modified pyspark demo script to include S3 configs

Verify this pull request

daragu commented 3 weeks ago

hi @alberttwong, your contribution is excellent. I have some opinions that can we have a demos as the parent folder.

demos/
  demo-local/
  demo-s3/
xtable-bot commented 3 weeks ago

CI report:

Bot commands @xtable-bot supports the following commands: - `@xtable-bot run azure` re-run the last Azure build
alberttwong commented 3 weeks ago

@daragu I think that's possible. The reason why I didn't change it was that there are links from xtable docs and other places to that demo folder.