[SUPPORT] Slower read performance with Partitioned Datasets

neerajpadarthi commented 2 months ago

Describe the problem you faced

Hi team, When loading the partitioned dataset, I am experiencing slower read performance, even without executing any Spark actions. Can you please check the following configurations/details and let us know if this is an expected delay even after enabling the metadata during reads? Thanks

I am using EMR 6.7 with Hudi Version 0.11.0.

Spark Submit - spark-submit --master yarn --deploy-mode client --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog --conf spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.sql.hive.convertMetastoreParquet=false --conf spark.hadoop.fs.s3.maxRetries=50 --conf spark.shuffle.blockTransferService=nio --jars /usr/lib/hudi/hudi-spark-bundle.jar,/usr/lib/spark/external/lib/spark-avro.jar

Dataset - Contains 5,864 partitions

Metadata is disabled

df = spark.sql("SELECT * FROM tst_db.tst_tb_partitioned_tst") >> Time took - 226 Seconds df.count() >> Time took - 24 Seconds

Metadata is enabled

spark.conf.set("hoodie.metadata.enable","true") df = spark.sql("SELECT * FROM tst_db.tst_tb_partitioned_tst") >> Time took - 58 Seconds df.count() >> Time took - 34 Seconds

ad1happy2go commented 1 month ago

@neerajpadarthi Sorry for the delay here. Did you ran these in same spark-shell opened? With metadata disabled it do list of the files. Can you check how many file groups you have in your dataset. You can try running this - https://medium.com/@simpsons/monitoring-table-stats-22684eb70ee1

ad1happy2go commented 1 month ago

@neerajpadarthi Were you able to find out the issue or still facing this. Let us know. thanks.

KnightChess commented 1 month ago

@neerajpadarthi can you try version 0.13.0 or above, analyer will fetch all partition in 0.11.0 I think. https://issues.apache.org/jira/browse/HUDI-4812

ad1happy2go commented 1 month ago

@neerajpadarthi Let us know in case you were able to try with later Hudi versions. Thanks.

ad1happy2go commented 3 weeks ago

@neerajpadarthi Closing this out. Please reopen in case of any concerns. Thanks.

apache / hudi

[SUPPORT] Slower read performance with Partitioned Datasets #11583