Open HaoYang670 opened 1 year ago
This affects both Spark 3.4 and Spark 3.3.2
We currently do not support "_metadata" columns and fall back to the CPU if we see it (this is for Spark 3.3.0).
scala> spark.read.parquet("./target/DF").selectExpr("*", "_metadata").show(truncate=false)
23/01/04 16:31:07 WARN GpuOverrides:
!Exec <CollectLimitExec> cannot run on GPU because the Exec CollectLimitExec has been disabled, and is disabled by default because Collect Limit replacement can be slower on the GPU, if huge number of rows in a batch it could help by limiting the number of rows transferred from GPU to CPU. Set spark.rapids.sql.exec.CollectLimitExec to true if you wish to enable it
@Partitioning <SinglePartition$> could run on GPU
*Exec <ProjectExec> will run on GPU
*Expression <Alias> cast(a#64 as string) AS a#80 will run on GPU
*Expression <Cast> cast(a#64 as string) will run on GPU
*Expression <Alias> cast(_metadata#70 as string) AS _metadata#83 will run on GPU
*Expression <Cast> cast(_metadata#70 as string) will run on GPU
*Exec <ProjectExec> will run on GPU
*Expression <Alias> named_struct(file_path, file_path#88, file_name, file_name#89, file_size, file_size#90L, file_modification_time, file_modification_time#91) AS _metadata#70 will run on GPU
*Expression <CreateNamedStruct> named_struct(file_path, file_path#88, file_name, file_name#89, file_size, file_size#90L, file_modification_time, file_modification_time#91) will run on GPU
!Exec <FileSourceScanExec> cannot run on GPU because hidden metadata columns are not supported on GPU
I am not sure about row_index
. I don't see that in any of the PRs for the spark issue. It looks like we actually do all of the work for this on the GPU already, so it might be worth not falling back to the CPU and adding in some tests to cover it, but that is a separate issue.
Thank you for the explanation @revans2.
File #7458 to track the idea in your comment.
Is your feature request related to a problem? Please describe. Spark 3.4 makes all fields inside of
_metadata
not nullable (file_path
,file_name
,file_modification_time
,file_size
,row_index
).Related ticket: https://issues.apache.org/jira/browse/SPARK-41151
Describe the solution you'd like A clear and concise description of what you want to happen.
Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.
Additional context Add any other context, code examples, or references to existing implementations about the feature request here.