delta-io / delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
https://delta.io
Apache License 2.0
7.22k stars 1.62k forks source link

How does Optimize decide the File Size (Question) #3272

Open ugurkalkavan opened 1 week ago

ugurkalkavan commented 1 week ago

Hi, I used to use my own auto compaction method on a legacy system. How it basically works is that it calculates the sum of file size for every hive partition, and consolidate the files in every partition.

Example: for a partition, there are 1000 thousand files which are around 1 MB. Sum is 1 GB and the method divides the sum to 128 MB and ceil it , which is 8 in our case. it makes repartition it to 8.

after compaction, new total size is much less than 1 GB, it might be 700 MB. So ı needed to recursively run the function, till it reach the proper size. (generally two or three times.)

My question is that, How delta optimize deals with this issue ?

Thank you.