GoogleCloudPlatform / data-analytics-golden-demo

An end to end demo of Google's Cloud data and analytic stack.
Apache License 2.0
212 stars 69 forks source link

How to reference #73

Closed MCaviezel closed 1 year ago

MCaviezel commented 1 year ago

Dear Adam,

Thanks a lot for sharing all your knowledge. Our team is in the process of using Iceberg with Dataproc and then connect the Iceberg table to BigLake.

For us it seems like the "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions" are not getting loaded properly. We also don't find a jar for this. Can you tell us how you manage to make them run?

https://github.com/GoogleCloudPlatform/data-analytics-golden-demo/blob/5c10931f58516827cfe2ee84e0cc550965e8d2a5/dataproc/convert_taxi_to_iceberg_create_tables.py#L39C9-L39C111

Thanks and best regards, Marco

AdamPaternostro commented 1 year ago

Hi Marco,

Sorry for the delayed reply. An Airflow job runs that dataproc process which requires JAR files to be passed in:

https://github.com/GoogleCloudPlatform/data-analytics-golden-demo/blob/5c10931f58516827cfe2ee84e0cc550965e8d2a5/cloud-composer/dags/sample-iceberg-create-tables-update-data.py#L61

The file is downloaded during the deployment: https://github.com/GoogleCloudPlatform/data-analytics-golden-demo/blob/5c10931f58516827cfe2ee84e0cc550965e8d2a5/terraform-modules/deploy-files/tf-deploy-files.tf#L608

Let me know if that helps.