facebookincubator / velox

A composable and fully extensible C++ execution engine library for data management systems.
https://velox-lib.io/
Apache License 2.0
3.53k stars 1.16k forks source link

misc: Add compression format suffix to the written Parquet file #11563

Closed liujiayi771 closed 3 days ago

liujiayi771 commented 1 week ago

Adding a compression format suffix to the written Parquet file, and make it easier to determine the compression format of the Parquet file.

netlify[bot] commented 1 week ago

Deploy Preview for meta-velox canceled.

Name Link
Latest commit 734c71d113e2aa0f1a466948b812565c8af0277e
Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/6739789b75ef9700088bf8ae
liujiayi771 commented 1 week ago

@majetideepak Could you help to review?

majetideepak commented 1 week ago

@liujiayi771 I have mixed opinions on this. I do see the benefit but then the question becomes why not other file details. I also do not see this for files generated by other tools. Ideally, we could make this pluggable. @Yuhta, @pedroerp Do you have any thoughts on this?

liujiayi771 commented 1 week ago

@majetideepak I added this because the Parquet files written by Spark have compressed format suffixes. However, I also tested the Parquet files written by Presto, which do not even have the .parquet suffix.

majetideepak commented 6 days ago

@liujiayi771 It makes sense that Spark users of Gluten+Velox would like to see Spark filenames and similarly, Presto users would prefer Presto filenames. Maybe we can add a simple HiveConfig for this? Something like: hive.parquet.write-filename-suffix="presto". "spark" can be another option. Velox already has Spark functions. So using the application name here should be okay imo.

Yuhta commented 6 days ago

@liujiayi771 @majetideepak I think we need to make it more general than just Presto or Spark, there are many more engines inside Meta using Velox than these two. We can make this a global HiveDataSinkFileNameGenerator for now that takes bucket ID, connector query config, user specified file name (targetFileName), commit strategy, etc.

FelixYBW commented 6 days ago

@liujiayi771 Can we add it from Gluten instead before pass to velox to make Velox more general?

liujiayi771 commented 6 days ago

@liujiayi771 Can we add it from Gluten instead before pass to velox to make Velox more general?

@FelixYBW Currently, the parquet file name written by Gluten is Gluten_Stage_3_TID_2124_VTID_257_0_3_0946dfb5-f773-42c9-ac8e-d4e70bede02b.parquet which is generated by

targetFileName = fmt::format(
        "{}_{}_{}_{}",
        connectorQueryCtx_->taskId(),
        connectorQueryCtx_->driverId(),
        connectorQueryCtx_->planNodeId(),
        makeUuid());

I found that Jimmy add a new targetFileName in LocationHandle, so we can specify the targetFileName that contains compression kind suffix from Gluten side. However, we will not be able to retain information such as task ID, driver ID, etc., because this information cannot be obtained in the planning phase. We can only generate file names similar to Spark that only contain UUID, for example, Gluten-0946dfb5-f773-42c9-ac8e-d4e70bede02b.zstd.parquet. Do you think it is necessary to retain task ID, stage ID, and driver ID in the file name?

cc @zhztheplayer, @rui-mo, @JkSelf.

JkSelf commented 6 days ago

@liujiayi771 In versions prior to 3.4 of Gluten, the fileName could be guaranteed to contain task ID information. This is because this information is created when the executor actually retrieves the data to write to the file, and then it is offloaded to the Parquet writer. However, in versions after 3.4, it is difficult to obtain this information when defining the file name during the planning phase. Therefore, I believe it is not feasible to maintain consistency with Spark in this regard. Unless we modify the file name within the actual Parquet writer, which I think is not very meaningful.

liujiayi771 commented 5 days ago

@JkSelf The Parquet files written by Spark only contain UUID in the format part-uuid.parquet. What I mean is, if the files produced by Gluten also only contain UUID, do you think there is a problem with that? Is it necessary to include task ID and driver ID information like we currently do?

JkSelf commented 5 days ago

@liujiayi771 I understand, thank you for your explanation. It's fine with me to remove the task ID and driver ID.

FelixYBW commented 5 days ago

Agree, let's keep the exact the same parquet name as Spark. Customer may use it in their tools. who know.

liujiayi771 commented 3 days ago

Close this PR, as we have handled this on the Gluten side. Thanks all.