apache / kyuubi

Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.
https://kyuubi.apache.org/
Apache License 2.0
2.08k stars 906 forks source link

[FEATURE] A new Spark SQL command to merge small files #6691

Open gabrywu opened 2 weeks ago

gabrywu commented 2 weeks ago

Code of Conduct

Search before asking

Describe the feature

A new Spark SQL command to merge small files

compact table table_name [INTO ${targetFileSize} ${targetFileSizeUnit} ] [ cleanup | retain | list ]
-- targetFileSizeUnit can be 'b','k','m','g','t','p'
-- cleanup means cleaning compact staging folders, which contains original small files, default behavior
-- retain means retaining compact staging folders, for testing, and we can recover with the staging data
-- list means this command only get the merging result, and don't run actually
recover compact table table_name
-- recover a table if compact table command fails

Motivation

There are many cases in which a SQL generate small files, we MUST merge them into bigger ones.

Describe the solution

This command doesn't read-write all of the records of a table, it just merges files in a binary level. Take a CSV table for example, it only appends the byte array from one file to another one, without reading & writing records

Additional context

referring to a blog

Are you willing to submit PR?