Open wesley-too opened 1 year ago
Looking at the code, that setting only applies if your table is partitioned, which in your example it isn't. Maybe would make sense to add something that handles the non-partitioned case? But I could see like repartition(1)
also causing issues if that's a default thing
As mentioned above, repartitionBeforeWrite
won't do anything here since the table isn't partitioned. Consider using .repartition(x)
as mentioned above as well as here: https://github.com/delta-io/delta/issues/239.
Please let me know if this helps to address your issue!
Bug
Describe the problem
In docs https://docs.delta.io/latest/delta-update.html#performance-tuning we have spark.databricks.delta.merge.repartitionBeforeWrite.enabled
The problem is that the repartitions are not working and file fragmentation is an issue.
Steps to reproduce
I'm using EMR 6.8.0, and this example can be run on a notebook.
First Notebook Cell
The pyspark code
If we run multiple merges, small files will keep being created and not merged or reused.
In production, we have more than 40k small files that we are processing via Change Data Capture (CDC).
Observed results
This is resulting in a high cost for S3 API operations on the pipeline due to API calls for objects.
We tried to use optimization as well, but another problem showed up.
We got an exception.
It seems like it tried to use the S3 path as the table name in the Hive catalog instead of using the Delta table path?
Expected results
It is expected that the merge will reuse files and not keep the small fragmented files. (Or I'm wrong?)
Environment information
Willingness to contribute
The Delta Lake Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the Delta Lake code base?