Currently we get the scan schema from the plan nodes scan schema, and then serialize that back to a Parquet schema, then parse that on the native side. This is lossy, particularly with timestamps. For example:
The former is the original Parquet footer, the latter is what we get after going through Spark. We need the original to handle timestamps correctly in ParquetExec.
This PR extracts some code from elsewhere (CometParquetFileFormat, CometNativeScanExec) to read the footer from the Parquet file, and serialize the original metadata. We also now generate the projection vector on the Spark side because the required columns is in Spark schema format, so will not match the Parquet schema 1:1. On the native side, we now have to regenerate the required schema from the Parquet schema using the projection vector (converted to a DF ProjectionMask).
Currently we get the scan schema from the plan nodes scan schema, and then serialize that back to a Parquet schema, then parse that on the native side. This is lossy, particularly with timestamps. For example:
The former is the original Parquet footer, the latter is what we get after going through Spark. We need the original to handle timestamps correctly in ParquetExec.
This PR extracts some code from elsewhere (CometParquetFileFormat, CometNativeScanExec) to read the footer from the Parquet file, and serialize the original metadata. We also now generate the projection vector on the Spark side because the required columns is in Spark schema format, so will not match the Parquet schema 1:1. On the native side, we now have to regenerate the required schema from the Parquet schema using the projection vector (converted to a DF ProjectionMask).