The issue use to add support to read/select file metadata cloumns for parquet scan if spark user explicitly reference them with (for example) select _metadata.file_path statement.
however, when spark integrating with velox, velox has no interface to let's spark to pass/inject those const metadata columns to native tablescan, that means the metadata columns is missed in velox table scan node, causing the select _metadata.file_path always return null.
This item try to track and fix the issue by extends the HiveConnectorSplit with a new parameter metadaColumns to let upsteram computing engine as spark to pass the initialized const metadata columns (if have) to velox connector split when constructed, then the scan node can has those metatada const columns available to generate output if output needed.
We have already verified the change end to end with Gluten, the related Gluten PR is here, once this PR ready, we would try merge gluten PR to make gluten support file metadat columns.
Description
The issue use to add support to read/select file metadata cloumns for parquet scan if spark user explicitly reference them with (for example)
select _metadata.file_path
statement.In details, spark SQL allows user to query the metatada (file_path, file_name, file_size, etc) of the input files, checking this https://issues.apache.org/jira/browse/SPARK-37273 and this PR https://github.com/apache/spark/commit/62cf4d41b12a3a6d94d011a5b76e66ccaa3fed2a
spark itself if detect user's Spark SQL has explicitly select the metadata columns as
select _metadata.file_path
, the it would get and insert the const metadata columns as new columns in fileScanRDD to return, checking here https://github.com/apache/spark/blob/18db204995b32e87a650f2f09f9bcf047ddafa90/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L204however, when spark integrating with velox, velox has no interface to let's spark to pass/inject those const metadata columns to native tablescan, that means the metadata columns is missed in velox table scan node, causing the
select _metadata.file_path
always return null.This item try to track and fix the issue by extends the
HiveConnectorSplit
with a new parametermetadaColumns
to let upsteram computing engine as spark to pass the initialized const metadata columns (if have) to velox connector split when constructed, then the scan node can has those metatada const columns available to generate output if output needed.We have already verified the change end to end with Gluten, the related Gluten PR is here, once this PR ready, we would try merge gluten PR to make gluten support file metadat columns.