Closed alkis closed 1 month ago
The majority of the time is spent in parsing thrift
list<>
andbinary
fields. These are notoriously slow to decode because they are variable sized.
Do you have actual data to support this? For list
I could understand (but the problem is really the number of nested elements), but for binary
this sounds rather unexpected.
Do you have actual data to support this? For
list
I could understand (but the problem is really the number of nested elements), but forbinary
this sounds rather unexpected.
The slowness is actually list<>
and particularly list<StructType>
with nested lists. binary
is not as large of a problem.
Incremental parquet metadata improvements
This is an alternative proposal to https://github.com/apache/parquet-format/pull/242 which can be executed with minimal changes to parquet readers/writers.
Wide schemata (large number of columns) make
FileMetadata
very slow to parse. The majority of the time is spent in parsing thriftlist<>
and in particular heavily nestedlist<StructType>
fields. These are notoriously slow to decode because they are variable sized and they involve extra allocations. In this proposal we avoid such fields as much as possible to improve decoding. In addition we allow columns that do not participate in a row group to have their column chunk metadata skipped.Jira
Commits
Documentation