apache / parquet-java

Apache Parquet Java
https://parquet.apache.org/
Apache License 2.0
2.65k stars 1.41k forks source link

GH-3040: DictionaryFilter.canDrop may return false positive result when dict size exceeds 8k #3041

Closed pan3793 closed 2 weeks ago

pan3793 commented 2 weeks ago

Rationale for this change

Fixes the data loss issue that reported in #3040

What changes are included in this PR?

Ensure that StreamBytesInput#writeInto(ByteBuffer buffer) copies data fully, even if the underlying InputStream does not report available correctly.

Are these changes tested?

UTs are added, I also tested it with an internal production data loss case.

Are there any user-facing changes?

Yes, this fixes some data loss cases, and I acknowledge that the bug affects Spark 4.0.0 preview2 which ships Parquet 1.14.2.

Closes #3040

pan3793 commented 2 weeks ago

cc @gszadovszky @wgtmac @Fokko

Fokko commented 2 weeks ago

Thanks @pan3793 for finding and fixing this, and thanks @wgtmac @ConeyLiu and @gszadovszky for the review 🙌