apache / parquet-format

Apache Parquet Format
https://parquet.apache.org/
Apache License 2.0
1.77k stars 430 forks source link

Deprecate RowGroup.file_offset #394

Open asfimport opened 3 years ago

asfimport commented 3 years ago

Due to PARQUET-2078 RowGroup.file_offset is not reliable.

This field is also wrongly calculated in the C++ oss parquet implementation PARQUET-2089

Reporter: Gabor Szadovszky / @gszadovszky Assignee: Gidon Gershinsky / @ggershinsky

Note: This issue was originally created as PARQUET-2080. Please see the migration documentation for further details.

asfimport commented 3 years ago

Gabor Szadovszky / @gszadovszky: @ggershinsky, however the original topic of this jira is invalid we still need to add proper comments to RowGroup.file_offset describing the situation of PARQUET-2078 and helping the implementations to handle the potential wrong value. Would you like to handle this?

asfimport commented 3 years ago

Gidon Gershinsky / @ggershinsky: @gszadovszky  yes, I'll take it. There might be a different solution (also format-related) that bypasses the need to calculate such parameter in any implementation, so it can be fully deprecated. I'll get back with the details and we'll discuss the trade-offs.

asfimport commented 3 years ago

Gidon Gershinsky / @ggershinsky: Hi @gszadovszky , I've prepared a short writeup on this alternative solution, with a discussion of the tradeoffs. After writing it, my feeling is that the trade-off is not in favor of this alternative option; but here it goes, just to cover all bases. Will appreciate your opinion on this.

asfimport commented 3 years ago

Gabor Szadovszky / @gszadovszky: @ggershinsky, could you make the doc available for comments?

asfimport commented 3 years ago

Gidon Gershinsky / @ggershinsky: Oh, sorry, done.