facebookincubator / velox

A composable and fully extensible C++ execution engine library for data management systems.
https://velox-lib.io/
Apache License 2.0
3.53k stars 1.16k forks source link

Spark input_file_name design #9957

Open Yuhta opened 6 months ago

Yuhta commented 6 months ago

The Spark implementation of input_file_name uses a thread local to stash the file name and retrieve it from the function. The same method does not work in Velox because the driver can be taken off from the thread and a different driver can be scheduled when the function is called. There are 2 ways to do it in Velox

  1. To mimic what Spark is doing, we need to store the information in DriverCtx. This imposes some challenge to hide the file specific detail from the driver level, while we need to be able to set it in table scan and read it back in the function.
  2. To mimic what Presto is doing, Gluten can change the plan to add an extra field $path to the output type of table scan, then the function will just project that special field out and do the escaping. All the data type between table scan and filer project needs to be changed in Gluten plan.

CC: @gaoyangxiaozhu @mbasmanova

mbasmanova commented 6 months ago

CC: @FelixYBW @rui-mo

FelixYBW commented 6 months ago

@gaoyangxiaozhu can we pass the input_file_name as a literal to velox for each split? Since it's fixed for each split.

Yuhta commented 6 months ago

@FelixYBW The path is already in the split. The problem is how to carry the information from split into the function.

FelixYBW commented 6 months ago

Oh, I see. your option 2 is what I'm thinking. Add a project after table scan to append a literal column to scan result, hide the input_file_name() implementation in Gluten completely. In this way we can add similar function implementation in Gluten directly. @rui-mo Is there any potential issue of this?

Yuhta commented 6 months ago

@FelixYBW If option 2 sound good to you, you can follow what Presto does by adding a Hive column handle with type kSynthesized and name $path, then extract that field in the function. We don't need to change anything in driver or table scan operator for this path.

FelixYBW commented 6 months ago

Make sense. I will follow up with Rui and Yangyang on this. Thank you. @Yuhta

rui-mo commented 6 months ago

@FelixYBW Looks like the second option is feasible in Gluten. Thanks.

Yohahaha commented 5 months ago

One Spark task may read multiple files according to spark.sql.files.maxPartitionBytes, which file will be returned in the Gluten for input_file_name in design?

Yuhta commented 5 months ago

@Yohahaha The $path value is set according to the information in split, does not matter how many splits the task is reading.

gaoyangxiaozhu commented 5 months ago

@FelixYBW If option 2 sound good to you, you can follow what Presto does by adding a Hive column handle with type kSynthesized and name $path, then extract that field in the function. We don't need to change anything in driver or table scan operator for this path.

@rui-mo / @FelixYBW just back from OOF, so do you agree with option 2 to use $path synthetic column ? If it is the option, i can follow the code implement.

gaoyangxiaozhu commented 5 months ago

@FelixYBW If option 2 sound good to you, you can follow what Presto does by adding a Hive column handle with type kSynthesized and name $path, then extract that field in the function. We don't need to change anything in driver or table scan operator for this path.

hey @Yuhta, quick question, do you have example about how to extract specific field in function can be referenced ? or @rui-mo you may also know ?

Yuhta commented 5 months ago

@gaoyangxiaozhu I think you need to do it in the planner, rewriting input_file_name() to url_encode($path)

gaoyangxiaozhu commented 5 months ago

@gaoyangxiaozhu I think you need to do it in the planner, rewriting input_file_name() to url_encode($path)

I see, so looks a little bit trick we still need change planner for this specifial case to leverage both url_encode function and $path ksynthetic column handler.

I can image we may need to apply a similar planning strategy to other parts with similar functions

@rui-mo / @FelixYBW to double check if it is a acceptable way before i start the code part.

FelixYBW commented 5 months ago

@rui-mo / @FelixYBW to double check if it is a acceptable way before i start the code part.

Go ahead to implement. Just talked with Rui. A new project will be too complex, let's add it in future.

gaoyangxiaozhu commented 5 months ago

got! thank you @FelixYBW / @rui-mo ! Let me do the follow up.