DRAFT: Incremental improvements to parquet metadata

alkis commented 1 month ago

Incremental parquet metadata improvements

This is an alternative proposal to https://github.com/apache/parquet-format/pull/242 which can be executed with minimal changes to parquet readers/writers.

Wide schemata (large number of columns) make FileMetadata very slow to parse. The majority of the time is spent in parsing thrift list<> and in particular heavily nested list<StructType> fields. These are notoriously slow to decode because they are variable sized and they involve extra allocations. In this proposal we avoid such fields as much as possible to improve decoding. In addition we allow columns that do not participate in a row group to have their column chunk metadata skipped.

Jira

[ ] My PR addresses the following Parquet Jira issues and references them in the PR title. For example, "PARQUET-1234: My Parquet PR"
- https://issues.apache.org/jira/browse/PARQUET-XXX
- In case you are adding a dependency, check if the license complies with the ASF 3rd Party License Policy.

Commits

[ ] My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters (not including Jira issue reference)
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

Documentation

[ ] In case of new functionality, my PR adds documentation that describes how to use it.
- All the public functions and the classes in the PR contain Javadoc that explain what it does

pitrou commented 1 month ago

The majority of the time is spent in parsing thrift list<> and binary fields. These are notoriously slow to decode because they are variable sized.

Do you have actual data to support this? For list I could understand (but the problem is really the number of nested elements), but for binary this sounds rather unexpected.

alkis commented 1 month ago

Do you have actual data to support this? For list I could understand (but the problem is really the number of nested elements), but for binary this sounds rather unexpected.

The slowness is actually list<> and particularly list<StructType> with nested lists. binary is not as large of a problem.

alkis commented 1 month ago

I am retracting this PR in favor of:

https://docs.google.com/document/d/1PQpY418LkIDHMFYCY8ne_G-CFpThK15LLpzWYbc7rFU/edit

and:

https://github.com/apache/parquet-format/pull/252 https://github.com/apache/parquet-format/pull/253 https://github.com/apache/parquet-format/pull/254

apache / parquet-format