Closed migurski closed 1 year ago
implemented in https://github.com/OSGeo/gdal/pull/8258, for Parquet single-file reading
(note the "no optimization is currently done" note in the doc is specific to Parquet Dataset reading, not Parquet single-file reading where there was already some optimizations)
Exciting, thank you!
Expected behavior and actual behavior.
In the GDAL Parquet driver "no optimization is currently done" when reading. A way to retrieve individual rows more quickly could use row group metadata to reduce complete table scans. I tried retrieving a record from the midpoint of a 507MB example file with GDAL and AWS Athena (PrestoDB) and found that Athena succeeded with only 96KB of data read:
GDAL needed to scan the entire file for the sample single row response, truncated output below:
In total approx. 542MB of the 507MB file were downloading including repeated ranges.
Steps to reproduce the problem.
The example file prepared from U.S. Census data:
Reading the remote file with
ogr2ogr
:An Athena table created to access the file after it’s uploaded to S3:
The SQL query used with Athena:
Operating system
Mac OSX 12.6.3
GDAL version and provenance
GDAL 3.7.1 installed via Homebrew