Closed emkornfield closed 10 months ago
Seems this patch doesn't need to consider backward-compatibility rules?
Seems this patch doesn't need to consider backward-compatibility rules?
Since we are storing rep/def levels directly I don't think so since those rules are applied on top of this data to make the correct inferences.
Do we want to include these statistics at both row group (column chunk) and page level? For the latter I am not sure it is the right approach. We implemented column indexes so one would not need to read the page header to get the related statistics. We even stopped writing
Statistics
into page headers in parquet-mr. If we only want these for the column chunk level then I would suggest having it underColumnMetaData
directly.
@gszadovsky Is there an argument against flexibility here? I believe parquet-cpp still writes page headers. One argument for page headers is it allows readers better incremental estimates of memory needed as they progress (although it is possible taking an average size per cell at column chunk is sufficient here), i.e. these stats accomplish a few things: 1. finer grain pushdown for nulls (mostly valuable at column chunk, and possibly adding to the page index). 2. estimating memory usage for reconstructing in memory avoid allocations during scans. 3. Query planning to get better estimates for things like broad-cast joins.
@emkornfield, I am not against re-adding such statistics to page headers if it have benefits. Meanwhile, I would only expect filling values that are really make sense and not redundant at the related level (page or column chunk). If a value make sense for filtering then it should be added to the column index instead of to the page header. However from specification point of view it would not be strictly required to describe the actual use case, it would be nice to do so at some place. WDYT?
However from specification point of view it would not be strictly required to describe the actual use case, it would be nice to do so at some place. WDYT?
This sounds reasonable, let me refactor this a little bit so we can add it to column index as well, with a little bit more description on each of the use-cases
@gszadovszky updated the PR based on your feedback, let me know what you think.
What do you think about adding SizeEstimationStatistics directly to ColumnMetaData and the histograms to the ColumnIndex?
this sounds fine to me. We can always add separately to page headers if people want them. @wgtmac concerns with this?
What do you think about adding SizeEstimationStatistics directly to ColumnMetaData and the histograms to the ColumnIndex?
this sounds fine to me. We can always add separately to page headers if people want them. @wgtmac concerns with this?
I am good with this approach.
@gszadovszky do we need a vote on adding this?
@gszadovszky do we need a vote on adding this?
@emkornfield, I think this PR was reviewed quite actively. Also, it adds optional statistics that an implementation might use or not. I do not feel this would require a formal vote. Maybe put up the PR on the dev list with a time limit so anyone would have the chance to comment. From my side, it is good to go.
BTW, do you have a plan to implement this in the parquet-cpp once merged? Just curious. @emkornfield
BTW, do you have a plan to implement this in the parquet-cpp once merged? Just curious. @emkornfield
@wgtmac I would like to get to it but probably won't have bandwidth at least this month, is this something you have time to take on?
@wgtmac I would like to get to it but probably won't have bandwidth at least this month, is this something you have time to take on?
Should we release format v2.10.0 before implementation? It has been years since the last release.
Should we release format v2.10.0 before implementation? It has been years since the last release.
A new release probably makes sense, I'm not clear on general ordering here. It would be great to document general rules around format changes, release, implementations, etc.
@emkornfield, I agree format related guidelines are not properly documented or mixed a bit with parquet-mr guidelines. We usually expect implementations of new features in the format before the release to prove the related change. However I am not sure if the current unreleased changes need proof if they make sense as is.
We usually expect implementations of new features in the format before the release to prove the related change. However I am not sure if the current unreleased changes need proof if they make sense as is.
Thanks @gszadovszky I think maybe we should do a release before this change goes in, as it seems the most likely to run into implementation challenges? @wgtmac thoughts?
Thanks @gszadovszky I think maybe we should do a release before this change goes in, as it seems the most likely to run into implementation challenges? @wgtmac thoughts?
That sounds good. Then what should we do once the POC implementation has been accepted? Release a new format version and then merge the implementation? IMHO, the bottom line is that release of new format should not be later than release of the associated implementation. Otherwise the release of parquet implementation may carry an unreleased parquet.thrift
file.
I have sent an email to discuss the next format release: https://lists.apache.org/thread/5hnr6r45rg8wpvnq0psq7gb5895nwcb2
Just want to check whether this PR should be included in the v2.10 release or not. @emkornfield @gszadovszky
@emkornfield I can start to draft this in C++ repo later this week
@emkornfield I can start to draft this in C++ repo later this week
I can proceed with the Java implementation to satisfy the two implementations requirement.
Some questions below. Also, most of this file seems to wrap around ~80 columns, can you please respect that?
Will take a pass at enforcing this a little later today.
Thank you @emkornfield for suggesting this change, @pitrou for your comment and @mapleFU, @wgtmac, @gszadovszky, @etseidl for the discussion.
In the libcudf chunked parquet reader, we would benefit greatly from having SizeStatistics
added to ColumnIndex
such as:
ColumnMetaData:
optional SizeStatistics size_estimate_statistics;
ColumnIndex:
optional list<SizeStatistics> size_estimate_statistics;
We would benefit from having page-level values for unencoded_variable_width_stored_bytes
because it would help us step through the pages of a row group to yield consistently-sized table "chunks". We created the chunked reader to read row groups that explode to 10-100 GB table sizes when decompressed and decoded.
The repetition_definition_level_histograms
is also useful for estimating row count per page and aligning the pages between ColumnChunks. We don't need to track FullSizeStatistics
in our use case, just the histograms and unencoded_variable_width_stored_bytes
at the page-level will suffice.
Thank you for your help!
Thanks @GregoryKimball and @etseidl We would also find this change very useful. As @GregoryKimball mentioned, we can use the extra size statistics in the page footer to be able to more accurately predict the memory usage of decompressed pages in files.
We have a usecase based on rapidsai/cudf that would greatly benefit from the the chunked Parquet reader working in the manner described above. Currently, we have to go to a lot of work in our GPU based Parquet reader to ensure that we don't try to read more of a Parquet file than we have room to decompress in GPU memory. With this change, that information would be available in the file and no prediction of sizes would be necessary.
We would really like to see this implemented!
Just wanted to tag that https://github.com/apache/parquet-format/pull/197#discussion_r1301338683 (tagging here because it is a comment on an outdated version) seems to be the last remaining question to work out before we can prototype. @wgtmac @mapleFU thank you for volunteering to prototype this, since you two will be doing the work I'll defer to your preference here. CC @etseidl @gszadovszky @GregoryKimball @m29498
Question about the two implementation rule...is there a set of preferred implementations? Since I'll likely be the one implementing this change in libcudf, I'd be happy to prototype this as well, Both with and without the SizeStatistics
in the ColumnIndex
to compare impacts on metadata sizes.
Hi all, just wanted to share some preliminary results with the new statistics. I implemented this PR using both the RepetitionDefinitionLevelHistogram
and the full SizeStatistics
struct in the ColumnIndex
. I used four files I use frequently for testing; two large files with a flat schema and varying mixes of integer and string data, and two smaller files that are deeply nested. The table below shows the impact on the size of the ColumnIndex
, as well as the impact to total file size, for each of the test files.
------------------------------------------------------------------------------
| | | column index size (bytes) |
| file | file size (MiB) | no size stats | histograms | full size stats |
------------------------------------------------------------------------------
| flat 1 | 1883.1 | 1730740 | 2229005 | 2498311 |
------------------------------------------------------------------------------
| flat 2 | 1695.4 | 2322339 | 2884139 | 3265139 |
------------------------------------------------------------------------------
| nested 1 | 12.1 | 3085 | 4287 | 4683 |
------------------------------------------------------------------------------
| nested 2 | 282.2 | 22704 | 34852 | 38267 |
------------------------------------------------------------------------------
For the files with a flat schema, the histograms resulted in a 24-29% increase in the index size. Adding in the unencoded size bumped that to a 41-44% increase. The large impact to the added size info is due to a) the lack of a repetition level histogram, and b) small definition level histogram (2 bins). For the nested files, the histograms added between 40-54% to the ColumnIndex
size, now that the repetition level histograms are populated, and the max definition level is as high as 9. For these files, the addition of the size info had a less dramatic effect, with the full stats adding between 52-69% to the index.
The overall impact on file size was negligible, however, with the largest increase being an additional .053%.
So the good news here is no dramatic increase in file sizes, but the bad news is a pretty significant hit to ColumnIndex
sizes. If the latter is a concern, perhaps it is a better idea to move the per-page size statistics to its own structure separate from the page indexes. Then the page histogram data could be skipped altogether if the filtering predicate doesn't include any null
logic.
@etseidl Thanks for prototyping. One more question, approximately how many columns where in each file (I'm trying to understand average size increase per column). Without actually doing benchmarking, I'd guess overall this growth probably does not add too much overhead (I'd guess if anything it is likely extra thrift parsing) rather then IO/memory.
@gszadovszky @wgtmac thoughts?
One more question, approximately how many columns where in each file
@emkornfield "flat 1" has 39 columns, "flat 2" has 8 columns, "nested 1" has 47 leaf columns, and "nested 2" has 24 leaf columns.
As far a performance goes, writing the indexes took 100s of microseconds vs total write times in the seconds :smile: Actually generating the histograms was a larger impact than writing them.
As far a performance goes, writing the indexes took 100s of microseconds vs total write times in the seconds 😄 Actually generating the histograms was a larger impact than writing them.
Do you have the time spent on collecting the histograms? And what about the average number of records per page and total number of records in the file? The reason I ask for this is that number of pages can significantly affect the page index size. @etseidl
From the above result, I am not so worried about the boost in the column index size. IMHO, though the initial design goal of page index is mainly for page filtering, OffsetIndex can be used individually for better I/O planning of pages instead of blindly to read them in sequence. Therefore I do not object to add SizeStatistics
to the ColumnIndex. The downsize is that people do not need this info have to pay for I/O and thrift deserialization of the SizeStatistics.
Do you have the time spent on collecting the histograms? And what about the average number of records per page and total number of records in the file? The reason I ask for this is that number of pages can significantly affect the page index size.
@wgtmac I'll have to get back to you on that (the data is on my work computer 😅). The number of rows per page should be around 20000 (but can be a little lower due to max_page_size constraints
), but the records per page can vary wildly in the nested files. I'll get some exact times tomorrow, but IIRC for the "flat 1" file, the histogram collection was under 30ms once I figured out how to do that part in parallel (it had been over 60ms with a serial implementation).
Edit: @wgtmac I measured the encode times for "flat 1" and "nested 1" both with and without the histograms. For "flat 1", calculating the histograms added 2ms to a total time of 989ms. Encoding the histograms and size info added about 40us to the total time. For "nested 1" I couldn't measure an impact for generating the histograms, and again got about 40us extra of 350ms for the entire write. So the time impact for my data is negligible.
Trying to get average page sizes is kind of difficult...instead I'm going to try adding the output of "dump -d" from the old parquet-tools for each file. I'm just including the first row group for each, but note at the top how many row groups there were. flat1.txt flat2.txt nested1.txt nested2.txt
After your data in https://github.com/apache/parquet-format/pull/197#issuecomment-1699773196 , now I'm positive on having a size in OffsetIndex
.
As the implemention detail, can we ignore the rep-def
histogram when max-rep <= 1, max-def <= 1
? Since we already have page-ordinal in OffsetIndex and null-count in ColumnIndex? This might take less space but make it a bit tricky. @etseidl @emkornfield
The second is that, I think should size better in OffsetIndex
rather than ColumnIndex
.
As the implemention detail, can we ignore the
rep-def
histogram whenmax-rep <= 1, max-def <= 1
? Since we already have page-ordinal in OffsetIndex and null-count in ColumnIndex? This might take less space but make it a bit tricky. @etseidl @emkornfield
I think that would be ok. My current implementation only writes the histograms when max_level > 0
, but could easily be changed to > 1
. On the read side, the logic is a little harder, but not unmanageable, especially since we already have to deal with the max_level == 0
case. Once we settle on where everything goes, I'll modify my code to make use of the new structures and see if there are any problems. @emkornfield does this work for you?
The second is that, I think should size better in
OffsetIndex
rather thanColumnIndex
.
I'm fine with this. Kind of in the weeds, but by splitting it up this way we do save a little bit of space and processing not having to encode the SizeStatistics
wrapper.
Since we all seem to be in agreement now, it's probably good to list the options available and then make a decision on which to use. My (probably incomplete) list would be:
SizeStatistics
to ColumnIndex
. This is the simplest solution, keeps the new data together, and mirrors what is being added to ColumnMetaData
. The downside is extra storage and work for clients that may not use this new information.RepetitionDefinitionLevelHistogram
to ColumnIndex
and unencoded_variable_width_stored_bytes
to OffsetIndex
(either by adding it as an optional field in the PageLocation
, or as an optional list<i64>
in OffsetIndex
). This is the next simplest to implement, and has modest savings over option 1. This suffers the same drawback that clients are forced to read this extra information.ColumnMetaData
and a new struct containing list<SizeStatistics>
, mirroring how OffsetIndex
is written. This allows clients that have no need for this information to ignore it, and allows clients that don't need the full column/offset indexes access to just the size information, but adds complexity and requires reading a third structure for those clients that will use all three.I think 3 is maybe the most flexible, but since I'd almost always be using all three structures anyway, I'd likely vote for 1 or 2. If forced to pick, I'd probably take 1 right now since I already have it implemented :) I do have the cycles to try out 2 and 3 and can report back if that would be helpful.
I've implemented option 2 now. As expected, the size impact is somewhat less due to less nesting in the thrift output. Here are some comparisson numbers (apologies, ~it seems my earlier option1 implementation left out some bytes somewhere, so the sizes of the indexes have increased somewhat~ there was a bug in my parallel histogram generator, now fixed, so now the sizes match the earlier results. I've updated the numbers below). Option 3 should have the same size impact as option 1, but with those extra bytes moved to a new structure.
nested 1
column offset delta
no stats 3085 883
full stats 1 4683 883 +1598
full stats 2 4287 1139 +1458
nested 2
column offset delta
no stats 22704 6802
full stats 1 38267 6802 +15563
full stats 2 34852 9207 +14553
flat 1
column offset delta
no stats 1730740 854682
full stats 1 2498311 854682 +767571
full stats 2 2229005 1000333 +643916
flat 2
column offset delta
no stats 2322339 1027144
full stats 1 3265139 1027144 +942800
full stats 2 2885139 1267144 +802800
I also did a quick test using @mapleFU's suggestion to only write the histograms if max_level > 1
. As you'd expect, for the files with a flat schema no histogram data was written at all. For the nested files the histogram size was reduced, but not by much (only 300 bytes for "nested 2").
@emkornfield @wgtmac @mapleFU @gszadovszky
I forgot to mention that for option 2 I added unencoded_variable_width_stored_bytes
to the PageLocation
struct.
Now I think I'm leaning towards option 2. For some of my use cases, I think I can get away with just reading the offset index.
Thanks for the quick PoC! It seems that option 2 is the best at the moment. But option 1 has more flexibility if we intend to add more fields to SizeStatistics.
As the implemention detail, can we ignore the rep-def histogram when max-rep <= 1, max-def <= 1? Since we already have page-ordinal in OffsetIndex and null-count in ColumnIndex? This might take less space but make it a bit tricky. @etseidl @emkornfield
I don't see any downside for max-def level. For max-rep level this would lose the ability to do filter on queries like list_length(col) > 1
Regarding were to place size: I agree option 2 is the best, I think having a separate list makes the most sense. I'll update the PR to reflect this.
OK, pushed updates. @etseidl @mapleFU @wgtmac @pitrou @gszadovszky hopefully we can say this is a good version to prototype implementation on?
hopefully we can say this is a good version to prototype implementation on?
Looks good to me. I'll get started now.
Also cc @tustvold as arrow-rs parquet maintainer
Based on https://github.com/apache/parquet-format/pull/197#discussion_r1316059558 I think we can now move to the simpler option of just putting SizeStatistics on Column Index to consolidate everything? I would guess this would also make implementations simpler. CC @etseidl @wgtmac @mapleFU @pitrou @gszadovszky
I think we can now move to the simpler option of just putting SizeStatistics on Column Index to consolidate everything? I would guess this would also make implementations simpler.
I have both ways implemented on the writer side, so I do not hold a strong opinion in that regard. With regard to clients, if the unencoded byte sizes were to remain in the OffsetIndex, I can think of only one case where I would use the OffsetIndex alone and not also read the ColumnIndex, I do think using SizeStatistics on both the ColumnIndex and ColumnMetaData is more consistent. However I'm happy to yield to whoever has the strongest opinion.
I do not have a strong opinion on choosing option 1 or 2. Option 1 is better when predicates are unavailable and memory budget is not a concern, in which case only the OffsetIndex is required to read for I/O planning.
Java PoC implementation: https://github.com/apache/parquet-mr/pull/1177
C++ PoC as well https://github.com/rapidsai/cudf/pull/14000. Do we still need arrow-cpp to proceed? If so, @mapleFU have you started on https://github.com/apache/arrow/issues/37869 yet? If not, I have cycles to try a first pass at an implementation if that would be helpful.
C++ PoC as well rapidsai/cudf#14000. Do we still need arrow-cpp to proceed? If so, @mapleFU have you started on apache/arrow#37869 yet? If not, I have cycles to try a first pass at an implementation if that would be helpful.
IMHO, two PoC impls do not imply that they have to be parquet-mr and arrow-cpp. So should we proceed to a formal vote on this PR? @emkornfield @gszadovszky
Once this PR has been merged, I can proceed the release process of parquet format v2.10. After that, all pending PoCs can be formally reviewed and merged.
I think it is reasonable for two different implementations that can be demonstrated to interop with each other. We should really document the exact policy thought. @gszadovszky @pitrou
I think it is reasonable for two different implementations that can be demonstrated to interop with each other. We should really document the exact policy thought.
I have @wgtmac's branch compiled locally, and can test the interoperability between the two existing PoC implementations. FWIW I've been using a modified parquet-cli in parquet-mr to print out the statistics I generate, so I don't foresee any big problems there.
Update: I've verified that spark with Gang's changes can read my files, and my implementation can read the spark generated files. Also verified the histograms and unencoded sizes match at the chunk level (pages are problematic because we choose different cutoff strategies, so even if the limits are set to the same values we get different page sizes).
Update 2: I set rows per page to 100 and then got congruent files. Can now confirm both PoCs produce the same histograms and unencoded sizes at the page level as well. This is for a file with a wide variety of nesting with and without null values. Let me know if there are any specific files from parquet-testing you'd like to see run through both PoCs.
Thanks @etseidl for verifying two implementations!
As they are still in the PoC state, I think the manual verification is sufficient and prefer delaying the work on interoperability by adding parquet files with SizeStatistics to the parquet-testing repo. We can add testing files after the implementations are formally reviewed and merged.
WDYT? @emkornfield
@gszadovszky @shangxinli Have time take a look?