Open Yuhta opened 6 months ago
CC: @FelixYBW @rui-mo
@gaoyangxiaozhu can we pass the input_file_name as a literal to velox for each split? Since it's fixed for each split.
@FelixYBW The path is already in the split. The problem is how to carry the information from split into the function.
Oh, I see. your option 2 is what I'm thinking. Add a project after table scan to append a literal column to scan result, hide the input_file_name() implementation in Gluten completely. In this way we can add similar function implementation in Gluten directly. @rui-mo Is there any potential issue of this?
@FelixYBW If option 2 sound good to you, you can follow what Presto does by adding a Hive column handle with type kSynthesized
and name $path
, then extract that field in the function. We don't need to change anything in driver or table scan operator for this path.
Make sense. I will follow up with Rui and Yangyang on this. Thank you. @Yuhta
@FelixYBW Looks like the second option is feasible in Gluten. Thanks.
One Spark task may read multiple files according to spark.sql.files.maxPartitionBytes
, which file will be returned in the Gluten for input_file_name
in design?
@Yohahaha The $path
value is set according to the information in split, does not matter how many splits the task is reading.
@FelixYBW If option 2 sound good to you, you can follow what Presto does by adding a Hive column handle with type
kSynthesized
and name$path
, then extract that field in the function. We don't need to change anything in driver or table scan operator for this path.
@rui-mo / @FelixYBW just back from OOF, so do you agree with option 2 to use $path
synthetic column ? If it is the option, i can follow the code implement.
@FelixYBW If option 2 sound good to you, you can follow what Presto does by adding a Hive column handle with type
kSynthesized
and name$path
, then extract that field in the function. We don't need to change anything in driver or table scan operator for this path.
hey @Yuhta, quick question, do you have example about how to extract specific field in function
can be referenced ?
or @rui-mo you may also know ?
@gaoyangxiaozhu I think you need to do it in the planner, rewriting input_file_name()
to url_encode($path)
@gaoyangxiaozhu I think you need to do it in the planner, rewriting
input_file_name()
tourl_encode($path)
I see, so looks a little bit trick we still need change planner
for this specifial case to leverage both url_encode
function and $path
ksynthetic column handler.
I can image we may need to apply a similar planning strategy to other parts with similar functions
@rui-mo / @FelixYBW to double check if it is a acceptable way before i start the code part.
@rui-mo / @FelixYBW to double check if it is a acceptable way before i start the code part.
Go ahead to implement. Just talked with Rui. A new project will be too complex, let's add it in future.
got! thank you @FelixYBW / @rui-mo ! Let me do the follow up.
The Spark implementation of
input_file_name
uses a thread local to stash the file name and retrieve it from the function. The same method does not work in Velox because the driver can be taken off from the thread and a different driver can be scheduled when the function is called. There are 2 ways to do it in VeloxDriverCtx
. This imposes some challenge to hide the file specific detail from the driver level, while we need to be able to set it in table scan and read it back in the function.$path
to the output type of table scan, then the function will just project that special field out and do the escaping. All the data type between table scan and filer project needs to be changed in Gluten plan.CC: @gaoyangxiaozhu @mbasmanova