StarRocks / demo

Apache License 2.0
83 stars 57 forks source link

hudi docker-compose. here's an example of loading a parquet file. #49

Closed alberttwong closed 7 months ago

alberttwong commented 7 months ago

example shows only a 3 row insert. I wanted to show importing a large parquet file.

for future reference.

import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
import org.apache.spark.sql.SaveMode._
import org.apache.hudi.DataSourceReadOptions._
import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.config.HoodieWriteConfig._
import scala.collection.JavaConversions._

val df = spark.read.parquet("s3a://huditest/user_behavior_sample_data.parquet")

val databaseName = "hudi_sample"
val tableName = "hudi_coders_hive"
val basePath = "s3a://huditest/hudi_coders"

df.write.format("hudi").
  option(org.apache.hudi.config.HoodieWriteConfig.TABLE_NAME, tableName).
  option(RECORDKEY_FIELD_OPT_KEY, "UserID").
  option(PRECOMBINE_FIELD_OPT_KEY, "UserID").  
  option("hoodie.datasource.hive_sync.enable", "true").
  option("hoodie.datasource.hive_sync.mode", "hms").
  option("hoodie.datasource.hive_sync.database", databaseName).
  option("hoodie.datasource.hive_sync.table", tableName).
  option("hoodie.datasource.hive_sync.metastore.uris", "thrift://hive-metastore:9083").
  option("fs.defaultFS", "s3://huditest/").  
  mode(Overwrite).
  save(basePath)
alberttwong commented 7 months ago

issue with large parquet files. https://github.com/apache/hudi/issues/10697

alberttwong commented 7 months ago

I have a better example through the taobao tutorial.