apache / parquet-format

Apache Parquet Format
https://parquet.apache.org/
Apache License 2.0
1.69k stars 422 forks source link

DRAFT: Incremental improvements to parquet metadata #248

Closed alkis closed 1 month ago

alkis commented 1 month ago

Incremental parquet metadata improvements

This is an alternative proposal to https://github.com/apache/parquet-format/pull/242 which can be executed with minimal changes to parquet readers/writers.

Wide schemata (large number of columns) make FileMetadata very slow to parse. The majority of the time is spent in parsing thrift list<> and in particular heavily nested list<StructType> fields. These are notoriously slow to decode because they are variable sized and they involve extra allocations. In this proposal we avoid such fields as much as possible to improve decoding. In addition we allow columns that do not participate in a row group to have their column chunk metadata skipped.

Jira

Commits

Documentation

pitrou commented 1 month ago

The majority of the time is spent in parsing thrift list<> and binary fields. These are notoriously slow to decode because they are variable sized.

Do you have actual data to support this? For list I could understand (but the problem is really the number of nested elements), but for binary this sounds rather unexpected.

alkis commented 1 month ago

Do you have actual data to support this? For list I could understand (but the problem is really the number of nested elements), but for binary this sounds rather unexpected.

The slowness is actually list<> and particularly list<StructType> with nested lists. binary is not as large of a problem.

alkis commented 1 month ago

I am retracting this PR in favor of:

https://docs.google.com/document/d/1PQpY418LkIDHMFYCY8ne_G-CFpThK15LLpzWYbc7rFU/edit

and:

https://github.com/apache/parquet-format/pull/252 https://github.com/apache/parquet-format/pull/253 https://github.com/apache/parquet-format/pull/254