apache / parquet-format

Apache Parquet Format
https://parquet.apache.org/
Apache License 2.0
1.76k stars 428 forks source link

PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering #197

Closed emkornfield closed 10 months ago

wgtmac commented 1 year ago

@gszadovszky @shangxinli Have time take a look?

mapleFU commented 1 year ago

Seems this patch doesn't need to consider backward-compatibility rules?

emkornfield commented 1 year ago

Seems this patch doesn't need to consider backward-compatibility rules?

Since we are storing rep/def levels directly I don't think so since those rules are applied on top of this data to make the correct inferences.

emkornfield commented 1 year ago

Do we want to include these statistics at both row group (column chunk) and page level? For the latter I am not sure it is the right approach. We implemented column indexes so one would not need to read the page header to get the related statistics. We even stopped writing Statistics into page headers in parquet-mr. If we only want these for the column chunk level then I would suggest having it under ColumnMetaData directly.

@gszadovsky Is there an argument against flexibility here? I believe parquet-cpp still writes page headers. One argument for page headers is it allows readers better incremental estimates of memory needed as they progress (although it is possible taking an average size per cell at column chunk is sufficient here), i.e. these stats accomplish a few things: 1. finer grain pushdown for nulls (mostly valuable at column chunk, and possibly adding to the page index). 2. estimating memory usage for reconstructing in memory avoid allocations during scans. 3. Query planning to get better estimates for things like broad-cast joins.

gszadovszky commented 1 year ago

@emkornfield, I am not against re-adding such statistics to page headers if it have benefits. Meanwhile, I would only expect filling values that are really make sense and not redundant at the related level (page or column chunk). If a value make sense for filtering then it should be added to the column index instead of to the page header. However from specification point of view it would not be strictly required to describe the actual use case, it would be nice to do so at some place. WDYT?

emkornfield commented 1 year ago

However from specification point of view it would not be strictly required to describe the actual use case, it would be nice to do so at some place. WDYT?

This sounds reasonable, let me refactor this a little bit so we can add it to column index as well, with a little bit more description on each of the use-cases

emkornfield commented 1 year ago

@gszadovszky updated the PR based on your feedback, let me know what you think.

emkornfield commented 1 year ago

What do you think about adding SizeEstimationStatistics directly to ColumnMetaData and the histograms to the ColumnIndex?

this sounds fine to me. We can always add separately to page headers if people want them. @wgtmac concerns with this?

wgtmac commented 1 year ago

What do you think about adding SizeEstimationStatistics directly to ColumnMetaData and the histograms to the ColumnIndex?

this sounds fine to me. We can always add separately to page headers if people want them. @wgtmac concerns with this?

I am good with this approach.

emkornfield commented 1 year ago

@gszadovszky do we need a vote on adding this?

gszadovszky commented 1 year ago

@gszadovszky do we need a vote on adding this?

@emkornfield, I think this PR was reviewed quite actively. Also, it adds optional statistics that an implementation might use or not. I do not feel this would require a formal vote. Maybe put up the PR on the dev list with a time limit so anyone would have the chance to comment. From my side, it is good to go.

wgtmac commented 1 year ago

BTW, do you have a plan to implement this in the parquet-cpp once merged? Just curious. @emkornfield

emkornfield commented 1 year ago

BTW, do you have a plan to implement this in the parquet-cpp once merged? Just curious. @emkornfield

@wgtmac I would like to get to it but probably won't have bandwidth at least this month, is this something you have time to take on?

wgtmac commented 1 year ago

@wgtmac I would like to get to it but probably won't have bandwidth at least this month, is this something you have time to take on?

Should we release format v2.10.0 before implementation? It has been years since the last release.

emkornfield commented 1 year ago

Should we release format v2.10.0 before implementation? It has been years since the last release.

A new release probably makes sense, I'm not clear on general ordering here. It would be great to document general rules around format changes, release, implementations, etc.

gszadovszky commented 1 year ago

@emkornfield, I agree format related guidelines are not properly documented or mixed a bit with parquet-mr guidelines. We usually expect implementations of new features in the format before the release to prove the related change. However I am not sure if the current unreleased changes need proof if they make sense as is.

emkornfield commented 1 year ago

We usually expect implementations of new features in the format before the release to prove the related change. However I am not sure if the current unreleased changes need proof if they make sense as is.

Thanks @gszadovszky I think maybe we should do a release before this change goes in, as it seems the most likely to run into implementation challenges? @wgtmac thoughts?

wgtmac commented 1 year ago

Thanks @gszadovszky I think maybe we should do a release before this change goes in, as it seems the most likely to run into implementation challenges? @wgtmac thoughts?

That sounds good. Then what should we do once the POC implementation has been accepted? Release a new format version and then merge the implementation? IMHO, the bottom line is that release of new format should not be later than release of the associated implementation. Otherwise the release of parquet implementation may carry an unreleased parquet.thrift file.

wgtmac commented 1 year ago

I have sent an email to discuss the next format release: https://lists.apache.org/thread/5hnr6r45rg8wpvnq0psq7gb5895nwcb2

Just want to check whether this PR should be included in the v2.10 release or not. @emkornfield @gszadovszky

mapleFU commented 1 year ago

@emkornfield I can start to draft this in C++ repo later this week

wgtmac commented 1 year ago

@emkornfield I can start to draft this in C++ repo later this week

I can proceed with the Java implementation to satisfy the two implementations requirement.

emkornfield commented 1 year ago

Some questions below. Also, most of this file seems to wrap around ~80 columns, can you please respect that?

Will take a pass at enforcing this a little later today.

GregoryKimball commented 1 year ago

Thank you @emkornfield for suggesting this change, @pitrou for your comment and @mapleFU, @wgtmac, @gszadovszky, @etseidl for the discussion.

In the libcudf chunked parquet reader, we would benefit greatly from having SizeStatistics added to ColumnIndex such as:

ColumnMetaData:
optional SizeStatistics size_estimate_statistics;

ColumnIndex:
optional list<SizeStatistics> size_estimate_statistics;

We would benefit from having page-level values for unencoded_variable_width_stored_bytes because it would help us step through the pages of a row group to yield consistently-sized table "chunks". We created the chunked reader to read row groups that explode to 10-100 GB table sizes when decompressed and decoded.

The repetition_definition_level_histograms is also useful for estimating row count per page and aligning the pages between ColumnChunks. We don't need to track FullSizeStatistics in our use case, just the histograms and unencoded_variable_width_stored_bytes at the page-level will suffice.

Thank you for your help!

m29498 commented 1 year ago

Thanks @GregoryKimball and @etseidl We would also find this change very useful. As @GregoryKimball mentioned, we can use the extra size statistics in the page footer to be able to more accurately predict the memory usage of decompressed pages in files.

We have a usecase based on rapidsai/cudf that would greatly benefit from the the chunked Parquet reader working in the manner described above. Currently, we have to go to a lot of work in our GPU based Parquet reader to ensure that we don't try to read more of a Parquet file than we have room to decompress in GPU memory. With this change, that information would be available in the file and no prediction of sizes would be necessary.

We would really like to see this implemented!

emkornfield commented 1 year ago

Just wanted to tag that https://github.com/apache/parquet-format/pull/197#discussion_r1301338683 (tagging here because it is a comment on an outdated version) seems to be the last remaining question to work out before we can prototype. @wgtmac @mapleFU thank you for volunteering to prototype this, since you two will be doing the work I'll defer to your preference here. CC @etseidl @gszadovszky @GregoryKimball @m29498

etseidl commented 1 year ago

Question about the two implementation rule...is there a set of preferred implementations? Since I'll likely be the one implementing this change in libcudf, I'd be happy to prototype this as well, Both with and without the SizeStatistics in the ColumnIndex to compare impacts on metadata sizes.

etseidl commented 1 year ago

Hi all, just wanted to share some preliminary results with the new statistics. I implemented this PR using both the RepetitionDefinitionLevelHistogram and the full SizeStatistics struct in the ColumnIndex. I used four files I use frequently for testing; two large files with a flat schema and varying mixes of integer and string data, and two smaller files that are deeply nested. The table below shows the impact on the size of the ColumnIndex, as well as the impact to total file size, for each of the test files.

------------------------------------------------------------------------------
|          |                 |            column index size (bytes)          |
| file     | file size (MiB) | no size stats |  histograms | full size stats |
------------------------------------------------------------------------------
| flat 1   |     1883.1      |   1730740     |   2229005   |     2498311     |
------------------------------------------------------------------------------
| flat 2   |     1695.4      |   2322339     |   2884139   |     3265139     |
------------------------------------------------------------------------------
| nested 1 |       12.1      |      3085     |      4287   |        4683     |
------------------------------------------------------------------------------
| nested 2 |      282.2      |     22704     |     34852   |       38267     |
------------------------------------------------------------------------------

For the files with a flat schema, the histograms resulted in a 24-29% increase in the index size. Adding in the unencoded size bumped that to a 41-44% increase. The large impact to the added size info is due to a) the lack of a repetition level histogram, and b) small definition level histogram (2 bins). For the nested files, the histograms added between 40-54% to the ColumnIndex size, now that the repetition level histograms are populated, and the max definition level is as high as 9. For these files, the addition of the size info had a less dramatic effect, with the full stats adding between 52-69% to the index.

The overall impact on file size was negligible, however, with the largest increase being an additional .053%.

So the good news here is no dramatic increase in file sizes, but the bad news is a pretty significant hit to ColumnIndex sizes. If the latter is a concern, perhaps it is a better idea to move the per-page size statistics to its own structure separate from the page indexes. Then the page histogram data could be skipped altogether if the filtering predicate doesn't include any null logic.

emkornfield commented 1 year ago

@etseidl Thanks for prototyping. One more question, approximately how many columns where in each file (I'm trying to understand average size increase per column). Without actually doing benchmarking, I'd guess overall this growth probably does not add too much overhead (I'd guess if anything it is likely extra thrift parsing) rather then IO/memory.

@gszadovszky @wgtmac thoughts?

etseidl commented 1 year ago

One more question, approximately how many columns where in each file

@emkornfield "flat 1" has 39 columns, "flat 2" has 8 columns, "nested 1" has 47 leaf columns, and "nested 2" has 24 leaf columns.

As far a performance goes, writing the indexes took 100s of microseconds vs total write times in the seconds :smile: Actually generating the histograms was a larger impact than writing them.

wgtmac commented 1 year ago

As far a performance goes, writing the indexes took 100s of microseconds vs total write times in the seconds 😄 Actually generating the histograms was a larger impact than writing them.

Do you have the time spent on collecting the histograms? And what about the average number of records per page and total number of records in the file? The reason I ask for this is that number of pages can significantly affect the page index size. @etseidl

From the above result, I am not so worried about the boost in the column index size. IMHO, though the initial design goal of page index is mainly for page filtering, OffsetIndex can be used individually for better I/O planning of pages instead of blindly to read them in sequence. Therefore I do not object to add SizeStatistics to the ColumnIndex. The downsize is that people do not need this info have to pay for I/O and thrift deserialization of the SizeStatistics.

etseidl commented 1 year ago

Do you have the time spent on collecting the histograms? And what about the average number of records per page and total number of records in the file? The reason I ask for this is that number of pages can significantly affect the page index size.

@wgtmac I'll have to get back to you on that (the data is on my work computer 😅). The number of rows per page should be around 20000 (but can be a little lower due to max_page_size constraints), but the records per page can vary wildly in the nested files. I'll get some exact times tomorrow, but IIRC for the "flat 1" file, the histogram collection was under 30ms once I figured out how to do that part in parallel (it had been over 60ms with a serial implementation).

Edit: @wgtmac I measured the encode times for "flat 1" and "nested 1" both with and without the histograms. For "flat 1", calculating the histograms added 2ms to a total time of 989ms. Encoding the histograms and size info added about 40us to the total time. For "nested 1" I couldn't measure an impact for generating the histograms, and again got about 40us extra of 350ms for the entire write. So the time impact for my data is negligible.

Trying to get average page sizes is kind of difficult...instead I'm going to try adding the output of "dump -d" from the old parquet-tools for each file. I'm just including the first row group for each, but note at the top how many row groups there were. flat1.txt flat2.txt nested1.txt nested2.txt

mapleFU commented 1 year ago

After your data in https://github.com/apache/parquet-format/pull/197#issuecomment-1699773196 , now I'm positive on having a size in OffsetIndex.

As the implemention detail, can we ignore the rep-def histogram when max-rep <= 1, max-def <= 1? Since we already have page-ordinal in OffsetIndex and null-count in ColumnIndex? This might take less space but make it a bit tricky. @etseidl @emkornfield

The second is that, I think should size better in OffsetIndex rather than ColumnIndex.

etseidl commented 1 year ago

As the implemention detail, can we ignore the rep-def histogram when max-rep <= 1, max-def <= 1? Since we already have page-ordinal in OffsetIndex and null-count in ColumnIndex? This might take less space but make it a bit tricky. @etseidl @emkornfield

I think that would be ok. My current implementation only writes the histograms when max_level > 0, but could easily be changed to > 1. On the read side, the logic is a little harder, but not unmanageable, especially since we already have to deal with the max_level == 0 case. Once we settle on where everything goes, I'll modify my code to make use of the new structures and see if there are any problems. @emkornfield does this work for you?

The second is that, I think should size better in OffsetIndex rather than ColumnIndex.

I'm fine with this. Kind of in the weeds, but by splitting it up this way we do save a little bit of space and processing not having to encode the SizeStatistics wrapper.

etseidl commented 1 year ago

Since we all seem to be in agreement now, it's probably good to list the options available and then make a decision on which to use. My (probably incomplete) list would be:

  1. Simply add SizeStatistics to ColumnIndex. This is the simplest solution, keeps the new data together, and mirrors what is being added to ColumnMetaData. The downside is extra storage and work for clients that may not use this new information.
  2. Add RepetitionDefinitionLevelHistogram to ColumnIndex and unencoded_variable_width_stored_bytes to OffsetIndex (either by adding it as an optional field in the PageLocation, or as an optional list<i64> in OffsetIndex). This is the next simplest to implement, and has modest savings over option 1. This suffers the same drawback that clients are forced to read this extra information.
  3. Add a size/location pair to ColumnMetaData and a new struct containing list<SizeStatistics>, mirroring how OffsetIndex is written. This allows clients that have no need for this information to ignore it, and allows clients that don't need the full column/offset indexes access to just the size information, but adds complexity and requires reading a third structure for those clients that will use all three.

I think 3 is maybe the most flexible, but since I'd almost always be using all three structures anyway, I'd likely vote for 1 or 2. If forced to pick, I'd probably take 1 right now since I already have it implemented :) I do have the cycles to try out 2 and 3 and can report back if that would be helpful.

etseidl commented 1 year ago

I've implemented option 2 now. As expected, the size impact is somewhat less due to less nesting in the thrift output. Here are some comparisson numbers (apologies, ~it seems my earlier option1 implementation left out some bytes somewhere, so the sizes of the indexes have increased somewhat~ there was a bug in my parallel histogram generator, now fixed, so now the sizes match the earlier results. I've updated the numbers below). Option 3 should have the same size impact as option 1, but with those extra bytes moved to a new structure.

nested 1
                 column  offset   delta
  no stats         3085     883
  full stats 1     4683     883    +1598
  full stats 2     4287    1139    +1458

nested 2
                 column  offset   delta
  no stats        22704    6802
  full stats 1    38267    6802   +15563
  full stats 2    34852    9207   +14553

flat 1
                 column  offset   delta
  no stats      1730740  854682
  full stats 1  2498311  854682  +767571
  full stats 2  2229005 1000333  +643916

flat 2
                 column  offset   delta
  no stats      2322339 1027144
  full stats 1  3265139 1027144  +942800
  full stats 2  2885139 1267144  +802800

I also did a quick test using @mapleFU's suggestion to only write the histograms if max_level > 1. As you'd expect, for the files with a flat schema no histogram data was written at all. For the nested files the histogram size was reduced, but not by much (only 300 bytes for "nested 2"). @emkornfield @wgtmac @mapleFU @gszadovszky

etseidl commented 1 year ago

I forgot to mention that for option 2 I added unencoded_variable_width_stored_bytes to the PageLocation struct.

Now I think I'm leaning towards option 2. For some of my use cases, I think I can get away with just reading the offset index.

wgtmac commented 1 year ago

Thanks for the quick PoC! It seems that option 2 is the best at the moment. But option 1 has more flexibility if we intend to add more fields to SizeStatistics.

emkornfield commented 1 year ago

As the implemention detail, can we ignore the rep-def histogram when max-rep <= 1, max-def <= 1? Since we already have page-ordinal in OffsetIndex and null-count in ColumnIndex? This might take less space but make it a bit tricky. @etseidl @emkornfield

I don't see any downside for max-def level. For max-rep level this would lose the ability to do filter on queries like list_length(col) > 1

Regarding were to place size: I agree option 2 is the best, I think having a separate list makes the most sense. I'll update the PR to reflect this.

emkornfield commented 1 year ago

OK, pushed updates. @etseidl @mapleFU @wgtmac @pitrou @gszadovszky hopefully we can say this is a good version to prototype implementation on?

etseidl commented 1 year ago

hopefully we can say this is a good version to prototype implementation on?

Looks good to me. I'll get started now.

mapleFU commented 1 year ago

Also cc @tustvold as arrow-rs parquet maintainer

emkornfield commented 1 year ago

Based on https://github.com/apache/parquet-format/pull/197#discussion_r1316059558 I think we can now move to the simpler option of just putting SizeStatistics on Column Index to consolidate everything? I would guess this would also make implementations simpler. CC @etseidl @wgtmac @mapleFU @pitrou @gszadovszky

etseidl commented 1 year ago

I think we can now move to the simpler option of just putting SizeStatistics on Column Index to consolidate everything? I would guess this would also make implementations simpler.

I have both ways implemented on the writer side, so I do not hold a strong opinion in that regard. With regard to clients, if the unencoded byte sizes were to remain in the OffsetIndex, I can think of only one case where I would use the OffsetIndex alone and not also read the ColumnIndex, I do think using SizeStatistics on both the ColumnIndex and ColumnMetaData is more consistent. However I'm happy to yield to whoever has the strongest opinion.

wgtmac commented 1 year ago

I do not have a strong opinion on choosing option 1 or 2. Option 1 is better when predicates are unavailable and memory budget is not a concern, in which case only the OffsetIndex is required to read for I/O planning.

wgtmac commented 11 months ago

Java PoC implementation: https://github.com/apache/parquet-mr/pull/1177

etseidl commented 11 months ago

C++ PoC as well https://github.com/rapidsai/cudf/pull/14000. Do we still need arrow-cpp to proceed? If so, @mapleFU have you started on https://github.com/apache/arrow/issues/37869 yet? If not, I have cycles to try a first pass at an implementation if that would be helpful.

wgtmac commented 11 months ago

C++ PoC as well rapidsai/cudf#14000. Do we still need arrow-cpp to proceed? If so, @mapleFU have you started on apache/arrow#37869 yet? If not, I have cycles to try a first pass at an implementation if that would be helpful.

IMHO, two PoC impls do not imply that they have to be parquet-mr and arrow-cpp. So should we proceed to a formal vote on this PR? @emkornfield @gszadovszky

Once this PR has been merged, I can proceed the release process of parquet format v2.10. After that, all pending PoCs can be formally reviewed and merged.

emkornfield commented 11 months ago

I think it is reasonable for two different implementations that can be demonstrated to interop with each other. We should really document the exact policy thought. @gszadovszky @pitrou

etseidl commented 11 months ago

I think it is reasonable for two different implementations that can be demonstrated to interop with each other. We should really document the exact policy thought.

I have @wgtmac's branch compiled locally, and can test the interoperability between the two existing PoC implementations. FWIW I've been using a modified parquet-cli in parquet-mr to print out the statistics I generate, so I don't foresee any big problems there.

Update: I've verified that spark with Gang's changes can read my files, and my implementation can read the spark generated files. Also verified the histograms and unencoded sizes match at the chunk level (pages are problematic because we choose different cutoff strategies, so even if the limits are set to the same values we get different page sizes).

Update 2: I set rows per page to 100 and then got congruent files. Can now confirm both PoCs produce the same histograms and unencoded sizes at the page level as well. This is for a file with a wide variety of nesting with and without null values. Let me know if there are any specific files from parquet-testing you'd like to see run through both PoCs.

wgtmac commented 11 months ago

Thanks @etseidl for verifying two implementations!

As they are still in the PoC state, I think the manual verification is sufficient and prefer delaying the work on interoperability by adding parquet files with SizeStatistics to the parquet-testing repo. We can add testing files after the implementations are formally reviewed and merged.

WDYT? @emkornfield