DRAFT: Parquet 3 metadata with decoupled column metadata

pitrou commented 1 month ago

Parquet 3 metadata proposal

This is a very rough attempt at solving the problem of FileMetadata footprint and decoding cost, especially for Parquet files with many columns (think tens of thousands columns).

Context

This is in the context of the broader "Parquet v3" discussion on the mailing-list. A number of possible far-reaching changes are being collected in a document.

It is highly recommended that you read at least that document before commenting on this PR.

Specifically, some users would like to use Parquet files for data with tens of thousands of columns, and potentially hundreds or thousands of row groups. Reading the file-level metadata for such a file is prohibitively expensive given the current file structure where all column-level metadata is eagerly decoded as part of file-level metadata.

a new "Parquet 3" file structure with backwards compatibility with legacy readers
new Thrift structures allowing for decoupled decoding of file-level metadata and column metadata: file metadata is now O(n_columns + n_row_groups) instead of O(n_columns * n_row_groups)
removal of outdated, redundant or undesirable fields from the new structures

Jira

[ ] My PR addresses the following Parquet Jira issues and references them in the PR title. For example, "PARQUET-1234: My Parquet PR"
- https://issues.apache.org/jira/browse/PARQUET-XXX
- In case you are adding a dependency, check if the license complies with the ASF 3rd Party License Policy.

Commits

[ ] My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters (not including Jira issue reference)
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

Documentation

[ ] In case of new functionality, my PR adds documentation that describes how to use it.
- All the public functions and the classes in the PR contain Javadoc that explain what it does

tustvold commented 1 month ago

FWIW I'd be very interested to see how far we can push the current data structures with approaches like https://github.com/apache/arrow-rs/issues/5775, before reaching for format changes.

I'd also observe that the column statistics can already be stored separately from FileMetadata, and if you do so you're really only left with a couple of integers... The schema strikes me as a bigger potential bottleneck, but also one that I can't help feeling is unavoidable...

pitrou commented 1 month ago

FWIW I'd be very interested to see how far we can push the current data structures with approaches like apache/arrow-rs#5775, before reaching for format changes.

At first sight this would be a Rust-specific optimization. Also, while such improvements are good in themselves, they don't address the fundamental issue that file metadata size is currently O(n_row_groups * n_columns).

I'd also observe that the column statistics can already be stored separately from FileMetadata, and if you do so you're really only left with a couple of integers...

The main change in this PR is that a RowGroupV3 structure is O(1), instead of O(n_columns) for a RowGroup. The rest are assorted improvements.

tustvold commented 1 month ago

they don't address the fundamental issue that file metadata size is currently O(n_row_groups * n_columns).

Is it not still - https://github.com/apache/parquet-format/pull/242/files#diff-834c5a8d91719350b20995ad99d1cb6d8d68332b9ac35694f40e375bdb2d3e7cR1337

Edit: Oh I see you lose the per row-groupness. Although there is nothing to prevent having one big row group...

pitrou commented 1 month ago

Hmm. If SchemaElementV3 is an issue, we might further decouple things I suppose. Though I'm not sure how one would look up columns by names without decoding all the schema elements.

Writing one big row group is of course possible, but it probably comes with its own problems (such as RAM consumption in the writer?).

tustvold commented 1 month ago

At first sight this would be a Rust-specific optimization

The same optimisation could be done in C++, borrows are just pointers with compiler enforced lifetimes, but I accept it might be harder to achieve something similar in managed languages like Java without at least some cost.

pitrou commented 1 month ago

The same optimisation could be done in C++, borrows are just pointers with compiler enforced lifetimes

This assumes the Thrift C++ APIs allow this.

pitrou commented 1 month ago

I added a "PAR3 without legacy metadata" variation for the distant future.

kiszk commented 1 month ago

This is not directly related to a new structure. However, it would be a good opportunity to explicitly declare the endianness of data and meta-data.

corwinjoy commented 1 month ago

@pitrou In conjunction with this change, if we want improved random access for row groups and columns I think this would also be a good time to upgrade the OffsetIndex / ColumnIndex in two key ways:

Have OffsetIndex be stored in a random access way rather than using a list so that an individual page chunk can be loaded without needing to read the entire OffsetIndex array.
Have OffsetIndex explicitly include the dictionary page in addition to any data pages so that column data can be directly loaded from the OffsetIndex without needing to get all offsets from the metadata.

I think this would make the ColumnIndex a lot more powerful as it could then be used for projection pushdown in a much faster way without the large overhead it has now.

emkornfield commented 1 month ago

@pitrou In conjunction with this change, if we want improved random access for row groups and columns I think this would also be a good time to upgrade the OffsetIndex / ColumnIndex in two key ways:

Have OffsetIndex be stored in a random access way rather than using a list so that an individual page chunk can be loaded without needing to read the entire OffsetIndex array.

Have OffsetIndex explicitly include the dictionary page in addition to any data pages so that column data can be directly loaded from the OffsetIndex without needing to get all offsets from the metadata.

I think this would make the ColumnIndex a lot more powerful as it could then be used for projection pushdown in a much faster way without the large overhead it has now.

@corwinjoy IMO, I think these are reasonable suggestions, but I think they can be handled as a follow-up once we align on design principles here. In general for dictionaries (and other "auxiliary") metadata we should maybe consider this more holistically, on how pages can be linked effectively.

tustvold commented 1 month ago

Perhaps we could articulate the concrete use-cases we want to support with this? I understand that there is a desire to support extremely wide schemas of say 10,000 columns, but the precise nature of these columns eludes me?

The reason I ask this is if we stick with a standard page size of 1MB, then a 10,000 wide table with even distribution across the columns is unlikely to ever need multiple row groups - it will be 10GB just with just a single row group. This seems at odds with the stated motivation of this PR to avoid scaling per row group, which makes me think I am missing something.

Perhaps the use-case involves much smaller column chunks than normal, which would imply small pages, which might require changes beyond metadata if we want to support effectively? But at the same time I struggle to see why you would want to do this?

As an aside I did some toy benchmarking of parquet-rs, and confirmed that using thrift is perfectly fine, and can perform on par with flatbuffers - https://github.com/apache/arrow-rs/issues/5770#issuecomment-2116370344. It's a toy benchmark and should therefore be taken with a big grain of salt, but it at least would suggest 1us per column chunk is feasible

emkornfield commented 1 month ago

Perhaps we could articulate the concrete use-cases we want to support with this? I understand that there is a desire to support extremely wide schemas of say 10,000 columns, but the precise nature of these columns eludes me?

At least in datasets I've seen there are a small number of rows at least filtering (i.e. more columns then rows).

julienledem commented 1 month ago

Thank you Antoine On the mailing list Micah is collecting feedback in a document. https://lists.apache.org/thread/61z98xgq2f76jxfjgn5xfq1jhxwm3jwf

Would you mind putting your feedback there? We should collect the goals before jumping to solutions. It is a bit difficult to discuss goals directly in the thrift metadata.

pitrou commented 4 weeks ago

@alkis @JFinis and others, just a quick note that you've convinced me that this proposal is suboptimal for footer access latency, and something better is required. I hope we can design something that's reasonably clear and easy to implement.

JFinis commented 3 weeks ago

Interesting, so this feature is basically used to create a file comparable to an Iceberg manifest. I see that it can be used for that.

Design-wise, I'm not the biggest fan of this special casing this through an extra field instead of just storing a Parquet file that has all information in normal Parquet columns (like a DeltaLake checkpoint Parquet file), but the design is the way it is. Therefore, I do see that this field can be used this way and I guess therefore there is a valid use case for this, so it probably needs to be maintained for backward compatibility.

Cheers, Jan

Am Do., 6. Juni 2024 um 16:08 Uhr schrieb Rok Mihevc < @.***>:

@.**** commented on this pull request.

In src/main/thrift/parquet.thrift https://github.com/apache/parquet-format/pull/242#discussion_r1629624783 :

@@ -885,6 +971,44 @@ struct ColumnChunk { 9: optional binary encrypted_column_metadata }

+struct ColumnChunkV3 {

/ File where column data is stored. /

1: optional string file_path

PyArrow provides write_metadata https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_metadata.html and parquet_dataset https://arrow.apache.org/docs/python/generated/pyarrow.dataset.parquet_dataset.html#pyarrow.dataset.parquet_dataset for such use case.

— Reply to this email directly, view it on GitHub https://github.com/apache/parquet-format/pull/242#discussion_r1629624783, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALLIYWBZ23WE4BG6MVBI7TZGBUNFAVCNFSM6AAAAABHZ7LAPSVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDCMBSGA4TKMRQGM . You are receiving this because you were mentioned.Message ID: @.***>

apache / parquet-format