NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
https://nvidia.github.io/spark-rapids
Apache License 2.0
797 stars 232 forks source link

[FEA][SPARK-41151] Keep built-in file _metadata column nullable value consistent #7452

Open HaoYang670 opened 1 year ago

HaoYang670 commented 1 year ago

Is your feature request related to a problem? Please describe. Spark 3.4 makes all fields inside of _metadata not nullable (file_path, file_name, file_modification_time, file_size, row_index).

Related ticket: https://issues.apache.org/jira/browse/SPARK-41151

Describe the solution you'd like A clear and concise description of what you want to happen.

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context, code examples, or references to existing implementations about the feature request here.

HaoYang670 commented 1 year ago

This affects both Spark 3.4 and Spark 3.3.2

revans2 commented 1 year ago

We currently do not support "_metadata" columns and fall back to the CPU if we see it (this is for Spark 3.3.0).

scala> spark.read.parquet("./target/DF").selectExpr("*", "_metadata").show(truncate=false)
23/01/04 16:31:07 WARN GpuOverrides: 
!Exec <CollectLimitExec> cannot run on GPU because the Exec CollectLimitExec has been disabled, and is disabled by default because Collect Limit replacement can be slower on the GPU, if huge number of rows in a batch it could help by limiting the number of rows transferred from GPU to CPU. Set spark.rapids.sql.exec.CollectLimitExec to true if you wish to enable it
  @Partitioning <SinglePartition$> could run on GPU
  *Exec <ProjectExec> will run on GPU
    *Expression <Alias> cast(a#64 as string) AS a#80 will run on GPU
      *Expression <Cast> cast(a#64 as string) will run on GPU
    *Expression <Alias> cast(_metadata#70 as string) AS _metadata#83 will run on GPU
      *Expression <Cast> cast(_metadata#70 as string) will run on GPU
    *Exec <ProjectExec> will run on GPU
      *Expression <Alias> named_struct(file_path, file_path#88, file_name, file_name#89, file_size, file_size#90L, file_modification_time, file_modification_time#91) AS _metadata#70 will run on GPU
        *Expression <CreateNamedStruct> named_struct(file_path, file_path#88, file_name, file_name#89, file_size, file_size#90L, file_modification_time, file_modification_time#91) will run on GPU
      !Exec <FileSourceScanExec> cannot run on GPU because hidden metadata columns are not supported on GPU

I am not sure about row_index. I don't see that in any of the PRs for the spark issue. It looks like we actually do all of the work for this on the GPU already, so it might be worth not falling back to the CPU and adding in some tests to cover it, but that is a separate issue.

HaoYang670 commented 1 year ago

Thank you for the explanation @revans2.

HaoYang670 commented 1 year ago

File #7458 to track the idea in your comment.