OSGeo / gdal

GDAL is an open source MIT licensed translator library for raster and vector geospatial data formats.
https://gdal.org
Other
4.93k stars 2.56k forks source link

Parquet Driver: reading could benefit from row group optimization #8225

Closed migurski closed 1 year ago

migurski commented 1 year ago

Expected behavior and actual behavior.

In the GDAL Parquet driver "no optimization is currently done" when reading. A way to retrieve individual rows more quickly could use row group metadata to reduce complete table scans. I tried retrieving a record from the midpoint of a 507MB example file with GDAL and AWS Athena (PrestoDB) and found that Athena succeeded with only 96KB of data read:

{'DataScannedInBytes': 97710,
 'EngineExecutionTimeInMillis': 667,
 'QueryPlanningTimeInMillis': 82,
 'QueryQueueTimeInMillis': 196,
 'ResultReuseInformation': {'ReusedPreviousResult': False},
 'ServiceProcessingTimeInMillis': 30,
 'TotalExecutionTimeInMillis': 893}

GDAL needed to scan the entire file for the sample single row response, truncated output below:

HTTP: libcurl/7.79.1 (SecureTransport) LibreSSL/3.3.6 zlib/1.2.11 nghttp2/1.45.1
S3: Downloading 0-99999 (https://bucketname.s3.amazonaws.com/US-Census-tl_2021_06_tabblock20.parquet)...
S3: Got response_code=206
S3: Downloading 532000000-532019343 (https://bucketname.s3.amazonaws.com/US-Census-tl_2021_06_tabblock20.parquet)...
S3: Got response_code=206
S3: Downloading 531900000-531999999 (https://bucketname.s3.amazonaws.com/US-Census-tl_2021_06_tabblock20.parquet)...
S3: Got response_code=206
PARQUET: geo = {…}
PARQUET: gdal:schema = {…}
PARQUET: Compression (of first column): snappy
GDAL: GDALOpen(/vsis3/bucketname/US-Census-tl_2021_06_tabblock20.parquet, this=0x109c064f0) succeeds as Parquet.
GDAL: GDALDriver::Create(GeoJSON,/vsistdout,0,0,0,Unknown,0x0)
S3: Downloading 100000-199999 (https://bucketname.s3.amazonaws.com/US-Census-tl_2021_06_tabblock20.parquet)...
S3: Got response_code=206
…
S3: Downloading 522100000-531999999 (https://bucketname.s3.amazonaws.com/US-Census-tl_2021_06_tabblock20.parquet)...
S3: Got response_code=206
GDALVectorTranslate: 1 features written in layer 'US-Census-tl_2021_06_tabblock20'
ARROW: Memory pool: bytes_allocated = 68136704
ARROW: Memory pool: max_memory = 218117632
GDAL: GDALClose(/vsis3/bucketname/US-Census-tl_2021_06_tabblock20.parquet, this=0x109c064f0)
GDAL: GDALClose(/vsistdout, this=0x109d11150)
GDAL: In GDALDestroy - unloading GDAL shared library.
[50]:   

In total approx. 542MB of the 507MB file were downloading including repeated ranges.

Steps to reproduce the problem.

The example file prepared from U.S. Census data:

$ ogr2ogr -f Parquet -lco FID=FID -lco ROW_GROUP_SIZE=10000 -skipfailures -nlt MultiPolygon \
    /vsis3/bucketname/US-Census-tl_2021_06_tabblock20.parquet \
    /vsizip//vsicurl/https://www2.census.gov/geo/tiger/TIGER2021/TABBLOCK20/tl_2021_06_tabblock20.zip/tl_2021_06_tabblock20.shp

Reading the remote file with ogr2ogr:

$ CPL_VSIL_CURL_CHUNK_SIZE=100000 CPL_DEBUG=On \
    ogr2ogr -f GeoJSON -preserve_fid -where 'FID=259861' /vsistdout \
    /vsis3/bucketname/US-Census-tl_2021_06_tabblock20.parquet

An Athena table created to access the file after it’s uploaded to S3:

    CREATE EXTERNAL TABLE IF NOT EXISTS `default`.`tl_2021_06_tabblock20_parquet` (
      `FID` bigint,
      `STATEFP20` string,
      `COUNTYFP20` string,
      `TRACTCE20` string,
      `BLOCKCE20` string,
      `GEOID20` string,
      `NAME20` string,
      `MTFCC20` string,
      `UR20` string,
      `UACE20` string,
      `UATYPE20` string,
      `FUNCSTAT20` string,
      `ALAND20` bigint,
      `AWATER20` bigint,
      `INTPTLAT20` string,
      `INTPTLON20` string,
      `geometry` binary
    )
    ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
    STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
    LOCATION 's3://bucketname/'
    TBLPROPERTIES ('classification' = 'parquet')

The SQL query used with Athena:

SELECT * FROM "tl_2021_06_tabblock20_Parquet" WHERE FID=259861

Operating system

Mac OSX 12.6.3

GDAL version and provenance

GDAL 3.7.1 installed via Homebrew

rouault commented 1 year ago

implemented in https://github.com/OSGeo/gdal/pull/8258, for Parquet single-file reading

(note the "no optimization is currently done" note in the doc is specific to Parquet Dataset reading, not Parquet single-file reading where there was already some optimizations)

migurski commented 1 year ago

Exciting, thank you!