Open jeguiguren-cohere opened 1 year ago
Any guidance here on how to investigate further or what optimizations to try ?
@jeguiguren-cohere What spark configuration you are using?
@ad1happy2go Spark configuration is
conf = SparkConf()
conf.set('spark.serializer', 'org.apache.spark.serializer.KryoSerializer')
conf.set('spark.sql.legacy.pathOptionBehavior.enabled',True)
conf.set('spark.sql.hive.convertMetastoreParquet', False)
conf.set("spark.sql.extensions","org.apache.spark.sql.hudi.HoodieSparkSessionExtension")
spark_context = SparkContext.getOrCreate(conf)
glue_context = GlueContext(spark_context)
and df write is
df.write.format("hudi").options(**hudi_config).mode("append").save()
hudi_config = {
"hoodie.table.name": TABLE,
"hoodie.datasource.write.recordkey.field": "documentKey",
"hoodie.datasource.write.precombine.field": "clusterTime",
"hoodie.datasource.write.reconcile.schema": "false",
"hoodie.schema.on.read.enable": "true",
"hoodie.bulkinsert.sort.mode": "GLOBAL_SORT",
"hoodie.metadata.enable": "false",
"hoodie.datasource.hive_sync.database": DB_NAME,
"hoodie.datasource.hive_sync.table": TABLE,
"hoodie.datasource.hive_sync.use_jdbc": "false",
"hoodie.datasource.hive_sync.enable": "true",
"hoodie.datasource.hive_sync.partition_extractor_class": "org.apache.hudi.hive.NonPartitionedExtractor",
"hoodie.datasource.hive_sync.partition_value_extractor": "org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor",
"hoodie.index.type": "GLOBAL_SIMPLE",
"hoodie.datasource.write.keygenerator.class": "org.apache.hudi.keygen.NonpartitionedKeyGenerator"
}
@jeguiguren-cohere Can you try tuning spark configuration. https://hudi.apache.org/docs/tuning-guide/
Please let us know your findings. Thanks.
Describe the problem you faced We are using Hudi on AWS Glue to continuously merge small batches of data to bronze tables, and noticing slow write performance in upsert mode to COW table (20+ minutes).
The target table is ~small, approximately 6 million rows x 1000 columns, and the incoming batches have less than 50,000 records (which during preCombine step are reduced to less than 10,000 unique records). The table is not partitioned because it is small, and currently configured with simple global index.
Expected behavior
I would expect writes of this size to take a few minutes, similar to vanilla Spark job writing to S3 in parquet files.
Environment Description
Additional context
Table config in
/.hoodie/hoodie.properties
:Hudi config:
Spark stages show that majority of time (20+ minutes) is spent in "Doing partition and writing data"