databricks / spark-xml

XML data source for Spark SQL and DataFrames
Apache License 2.0
499 stars 226 forks source link

NumPartitions == num files. Can I choose partitions manually? #677

Closed hipp0gryph closed 6 months ago

hipp0gryph commented 6 months ago

Hello! Excuse for troubling. I read with that driver 150k very small files. I will get dataframe on 150k partitions and that df not good for a large number of operations. df.coalesce work so long. Can I choose partitions manually on read? Thank you in advice!

srowen commented 6 months ago

This isn't related to this library. You just need to coalesce(). That itself is not an expensive operation. But you will start with a partition per (small) file in Spark.