Closed alberttwong closed 7 months ago
example shows only a 3 row insert. I wanted to show importing a large parquet file.
for future reference.
import org.apache.spark.sql.functions._ import org.apache.spark.sql.types._ import org.apache.spark.sql.Row import org.apache.spark.sql.SaveMode._ import org.apache.hudi.DataSourceReadOptions._ import org.apache.hudi.DataSourceWriteOptions._ import org.apache.hudi.config.HoodieWriteConfig._ import scala.collection.JavaConversions._ val df = spark.read.parquet("s3a://huditest/user_behavior_sample_data.parquet") val databaseName = "hudi_sample" val tableName = "hudi_coders_hive" val basePath = "s3a://huditest/hudi_coders" df.write.format("hudi"). option(org.apache.hudi.config.HoodieWriteConfig.TABLE_NAME, tableName). option(RECORDKEY_FIELD_OPT_KEY, "UserID"). option(PRECOMBINE_FIELD_OPT_KEY, "UserID"). option("hoodie.datasource.hive_sync.enable", "true"). option("hoodie.datasource.hive_sync.mode", "hms"). option("hoodie.datasource.hive_sync.database", databaseName). option("hoodie.datasource.hive_sync.table", tableName). option("hoodie.datasource.hive_sync.metastore.uris", "thrift://hive-metastore:9083"). option("fs.defaultFS", "s3://huditest/"). mode(Overwrite). save(basePath)
issue with large parquet files. https://github.com/apache/hudi/issues/10697
I have a better example through the taobao tutorial.
example shows only a 3 row insert. I wanted to show importing a large parquet file.
for future reference.