Open yingsu00 opened 3 months ago
In Velox we use a little bit different approach. We receive the full table schema from Presto (already done), and if parquet_use_column_names
is set to false, we replace the names in fileType
with the names in table schema. This way we don't need to do custom matching. Code example: https://github.com/facebookincubator/velox/blob/63ccecaa2812c70aa7b2cda7a3a4b43abe0af223/velox/dwio/dwrf/reader/DwrfReader.cpp#L795-L804
cc @nmahadevuni @agrawalreetika
Description
Currently, the Velox HiveDataSource matches the column name from the file (fileType) with the requested schema name. THese two names could be different. For example, Presto Iceberg writer changes the space to "_x20". To solve this problem, Presto Parquet reader has a session property "parquet_use_column_names" and default it to false. When it's set to false, the hiveColumnIndex in HiveColumnHandle (or IcebergColumnHandle's columnIdentity.id for Iceberg) is used to map the schema column name to the actual column name in the file. Velox doesn't support this option and can only match the names, thus causing the column being considered as NULL constant.
To solve this, we need to send the hiveColumnIndex and columnIdentity.id from Presto to Velox first (https://github.com/prestodb/presto/issues/23130). Then Velox can use these IDs to match the columns instead of just using names.