apache / incubator-gluten

Gluten is a middle layer responsible for offloading JVM-based SQL engines' execution to native engines.
https://gluten.apache.org/
Apache License 2.0
1.22k stars 438 forks source link

[VL] Add compression codec extension to velox written parquet file #7999

Closed liujiayi771 closed 3 days ago

liujiayi771 commented 5 days ago

Description

Currently, the parquet file name written by Gluten is Gluten_Stage_3_TID_2124_VTID_257_0_3_0946dfb5-f773-42c9-ac8e-d4e70bede02b.parquet which is generated by the default behavior in velox HiveDataSink.cpp

targetFileName = fmt::format(
        "{}_{}_{}_{}",
        connectorQueryCtx_->taskId(),
        connectorQueryCtx_->driverId(),
        connectorQueryCtx_->planNodeId(),
        makeUuid());

https://github.com/facebookincubator/velox/pull/10903 add a new targetFileName in LocationHandle, so we can specify the targetFileName that contains compression kind suffix from Gluten side, which is more consistent with the parquet file name generated by vanilla Spark.

The parquet files generated by Spark are named part-uuid.codec-extension.parquet. I have defined the name of the parquet file written by Gluten as gluten-part-uuid.codec-extension.parquet, with the gluten prefix added to indicate that the file is generated by Gluten.