apache / parquet-java

Apache Parquet Java
https://parquet.apache.org/
Apache License 2.0
2.48k stars 1.37k forks source link

PARQUET-2139: fix file_offset field in ColumnChunk metadata #1369

Open etseidl opened 4 weeks ago

etseidl commented 4 weeks ago

Fixes the referenced issue wherein the file_offset field of the ColumnChunk object is improperly set to the offset of the first page in the column chunk. Because parquet-java does not write a copy of ColumnMetaData after the column chunk, this PR simply sets the value of file_offset to 0.

Make sure you have checked all steps below.

Jira

Tests

Commits

Style

Documentation

julienledem commented 3 weeks ago

I don't remember all the context, but if this is completely wrong, I'd rather deprecate the field and document it should not be used rather than setting the value to zero. Setting to zero has a few issues:

What do other implementations put in this field? (if no other implementation sets this, then this might be a different story)

etseidl commented 3 weeks ago

I'd rather deprecate the field and document it should not be used rather than setting the value to zero.

I agree with deprecating, but I'm less sanguine about leaving an incorrect value in parquet-java, especially given the fact that arrow-cpp (and arrow-rs I believe) populate this field correctly. Having such a big difference between major implementations is IMO more confusing than stating the field should be set to 0 (or -1) if there is no second copy of the ColumnMetaData.

it might break implementations that have been using this to find the first page.

Implementations that do this will break anyway if they try to read a file produced by arrow, so I don't know how big of a concern this is.

That said, if the consensus is to just leave this be, that's fine too...we'd just have to make note of differing interpretations in the format documents.