Closed neerajpadarthi closed 3 weeks ago
@neerajpadarthi Sorry for the delay here. Did you ran these in same spark-shell opened? With metadata disabled it do list of the files. Can you check how many file groups you have in your dataset. You can try running this - https://medium.com/@simpsons/monitoring-table-stats-22684eb70ee1
@neerajpadarthi Were you able to find out the issue or still facing this. Let us know. thanks.
@neerajpadarthi can you try version 0.13.0 or above, analyer will fetch all partition in 0.11.0 I think. https://issues.apache.org/jira/browse/HUDI-4812
@neerajpadarthi Let us know in case you were able to try with later Hudi versions. Thanks.
@neerajpadarthi Closing this out. Please reopen in case of any concerns. Thanks.
Describe the problem you faced
Hi team, When loading the partitioned dataset, I am experiencing slower read performance, even without executing any Spark actions. Can you please check the following configurations/details and let us know if this is an expected delay even after enabling the metadata during reads? Thanks
I am using EMR 6.7 with Hudi Version 0.11.0.
Spark Submit -
spark-submit --master yarn --deploy-mode client --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog --conf spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.sql.hive.convertMetastoreParquet=false --conf spark.hadoop.fs.s3.maxRetries=50 --conf spark.shuffle.blockTransferService=nio --jars /usr/lib/hudi/hudi-spark-bundle.jar,/usr/lib/spark/external/lib/spark-avro.jar
Dataset - Contains 5,864 partitions
df = spark.sql("SELECT * FROM tst_db.tst_tb_partitioned_tst") >> Time took -
226 Seconds
df.count() >> Time took -24 Seconds
spark.conf.set("hoodie.metadata.enable","true") df = spark.sql("SELECT * FROM tst_db.tst_tb_partitioned_tst") >> Time took -
58 Seconds
df.count() >> Time took -34 Seconds