Open etseidl opened 4 weeks ago
I don't remember all the context, but if this is completely wrong, I'd rather deprecate the field and document it should not be used rather than setting the value to zero. Setting to zero has a few issues:
What do other implementations put in this field? (if no other implementation sets this, then this might be a different story)
I'd rather deprecate the field and document it should not be used rather than setting the value to zero.
I agree with deprecating, but I'm less sanguine about leaving an incorrect value in parquet-java, especially given the fact that arrow-cpp (and arrow-rs I believe) populate this field correctly. Having such a big difference between major implementations is IMO more confusing than stating the field should be set to 0 (or -1) if there is no second copy of the ColumnMetaData
.
it might break implementations that have been using this to find the first page.
Implementations that do this will break anyway if they try to read a file produced by arrow, so I don't know how big of a concern this is.
That said, if the consensus is to just leave this be, that's fine too...we'd just have to make note of differing interpretations in the format documents.
Fixes the referenced issue wherein the
file_offset
field of theColumnChunk
object is improperly set to the offset of the first page in the column chunk. Because parquet-java does not write a copy ofColumnMetaData
after the column chunk, this PR simply sets the value offile_offset
to 0.Make sure you have checked all steps below.
Jira
Tests
file_offset
field, so unit testing is difficult. Manual inspection of generated files confirms the desired output.Commits
Style
mvn spotless:apply -Pvector-plugins
Documentation