Closed liujiayi771 closed 3 days ago
Name | Link |
---|---|
Latest commit | 734c71d113e2aa0f1a466948b812565c8af0277e |
Latest deploy log | https://app.netlify.com/sites/meta-velox/deploys/6739789b75ef9700088bf8ae |
@majetideepak Could you help to review?
@liujiayi771 I have mixed opinions on this. I do see the benefit but then the question becomes why not other file details. I also do not see this for files generated by other tools. Ideally, we could make this pluggable. @Yuhta, @pedroerp Do you have any thoughts on this?
@majetideepak I added this because the Parquet files written by Spark have compressed format suffixes. However, I also tested the Parquet files written by Presto, which do not even have the .parquet suffix.
@liujiayi771 It makes sense that Spark users of Gluten+Velox would like to see Spark filenames and similarly, Presto users would prefer Presto filenames.
Maybe we can add a simple HiveConfig for this? Something like: hive.parquet.write-filename-suffix="presto"
. "spark" can be another option. Velox already has Spark functions. So using the application name here should be okay imo.
@liujiayi771 @majetideepak I think we need to make it more general than just Presto or Spark, there are many more engines inside Meta using Velox than these two. We can make this a global HiveDataSinkFileNameGenerator
for now that takes bucket ID, connector query config, user specified file name (targetFileName
), commit strategy, etc.
@liujiayi771 Can we add it from Gluten instead before pass to velox to make Velox more general?
@liujiayi771 Can we add it from Gluten instead before pass to velox to make Velox more general?
@FelixYBW Currently, the parquet file name written by Gluten is Gluten_Stage_3_TID_2124_VTID_257_0_3_0946dfb5-f773-42c9-ac8e-d4e70bede02b.parquet which is generated by
targetFileName = fmt::format(
"{}_{}_{}_{}",
connectorQueryCtx_->taskId(),
connectorQueryCtx_->driverId(),
connectorQueryCtx_->planNodeId(),
makeUuid());
I found that Jimmy add a new targetFileName in LocationHandle, so we can specify the targetFileName that contains compression kind suffix from Gluten side. However, we will not be able to retain information such as task ID, driver ID, etc., because this information cannot be obtained in the planning phase. We can only generate file names similar to Spark that only contain UUID, for example, Gluten-0946dfb5-f773-42c9-ac8e-d4e70bede02b.zstd.parquet. Do you think it is necessary to retain task ID, stage ID, and driver ID in the file name?
cc @zhztheplayer, @rui-mo, @JkSelf.
@liujiayi771 In versions prior to 3.4 of Gluten, the fileName could be guaranteed to contain task ID information. This is because this information is created when the executor actually retrieves the data to write to the file, and then it is offloaded to the Parquet writer. However, in versions after 3.4, it is difficult to obtain this information when defining the file name during the planning phase. Therefore, I believe it is not feasible to maintain consistency with Spark in this regard. Unless we modify the file name within the actual Parquet writer, which I think is not very meaningful.
@JkSelf The Parquet files written by Spark only contain UUID in the format part-uuid.parquet. What I mean is, if the files produced by Gluten also only contain UUID, do you think there is a problem with that? Is it necessary to include task ID and driver ID information like we currently do?
@liujiayi771 I understand, thank you for your explanation. It's fine with me to remove the task ID and driver ID.
Agree, let's keep the exact the same parquet name as Spark. Customer may use it in their tools. who know.
Close this PR, as we have handled this on the Gluten side. Thanks all.
Adding a compression format suffix to the written Parquet file, and make it easier to determine the compression format of the Parquet file.