facebookincubator / velox

A composable and fully extensible C++ execution engine library for data management systems.
https://velox-lib.io/
Apache License 2.0
3.51k stars 1.15k forks source link

Fix file metadata columns null issue when intergrating with Velox using Spark #8173

Open gaoyangxiaozhu opened 10 months ago

gaoyangxiaozhu commented 10 months ago

Description

The issue use to add support to read/select file metadata cloumns for parquet scan if spark user explicitly reference them with (for example) select _metadata.file_path statement.

In details, spark SQL allows user to query the metatada (file_path, file_name, file_size, etc) of the input files, checking this https://issues.apache.org/jira/browse/SPARK-37273 and this PR https://github.com/apache/spark/commit/62cf4d41b12a3a6d94d011a5b76e66ccaa3fed2a

spark itself if detect user's Spark SQL has explicitly select the metadata columns as select _metadata.file_path , the it would get and insert the const metadata columns as new columns in fileScanRDD to return, checking here https://github.com/apache/spark/blob/18db204995b32e87a650f2f09f9bcf047ddafa90/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L204

however, when spark integrating with velox, velox has no interface to let's spark to pass/inject those const metadata columns to native tablescan, that means the metadata columns is missed in velox table scan node, causing the select _metadata.file_path always return null.

This item try to track and fix the issue by extends the HiveConnectorSplit with a new parameter metadaColumns to let upsteram computing engine as spark to pass the initialized const metadata columns (if have) to velox connector split when constructed, then the scan node can has those metatada const columns available to generate output if output needed.

We have already verified the change end to end with Gluten, the related Gluten PR is here, once this PR ready, we would try merge gluten PR to make gluten support file metadat columns.

image

aditi-pandit commented 5 months ago

@gaoyangxiaozhu : Seems like the spark integration piece is the only work left here. Are you working on it ?