facebookincubator / velox

A C++ vectorized database acceleration library aimed to optimizing query engines and data processing systems.
https://velox-lib.io/
Apache License 2.0
3.46k stars 1.13k forks source link

Support "parquet_use_column_names" = false in Velox #10388

Open yingsu00 opened 3 months ago

yingsu00 commented 3 months ago

Description

Currently, the Velox HiveDataSource matches the column name from the file (fileType) with the requested schema name. THese two names could be different. For example, Presto Iceberg writer changes the space to "_x20". To solve this problem, Presto Parquet reader has a session property "parquet_use_column_names" and default it to false. When it's set to false, the hiveColumnIndex in HiveColumnHandle (or IcebergColumnHandle's columnIdentity.id for Iceberg) is used to map the schema column name to the actual column name in the file. Velox doesn't support this option and can only match the names, thus causing the column being considered as NULL constant.

To solve this, we need to send the hiveColumnIndex and columnIdentity.id from Presto to Velox first (https://github.com/prestodb/presto/issues/23130). Then Velox can use these IDs to match the columns instead of just using names.

Yuhta commented 3 months ago

In Velox we use a little bit different approach. We receive the full table schema from Presto (already done), and if parquet_use_column_names is set to false, we replace the names in fileType with the names in table schema. This way we don't need to do custom matching. Code example: https://github.com/facebookincubator/velox/blob/63ccecaa2812c70aa7b2cda7a3a4b43abe0af223/velox/dwio/dwrf/reader/DwrfReader.cpp#L795-L804

yingsu00 commented 3 months ago

cc @nmahadevuni @agrawalreetika