Closed pan3793 closed 2 weeks ago
Fixes the data loss issue that reported in #3040
Ensure that StreamBytesInput#writeInto(ByteBuffer buffer) copies data fully, even if the underlying InputStream does not report available correctly.
StreamBytesInput#writeInto(ByteBuffer buffer)
InputStream
UTs are added, I also tested it with an internal production data loss case.
Yes, this fixes some data loss cases, and I acknowledge that the bug affects Spark 4.0.0 preview2 which ships Parquet 1.14.2.
Closes #3040
cc @gszadovszky @wgtmac @Fokko
Thanks @pan3793 for finding and fixing this, and thanks @wgtmac @ConeyLiu and @gszadovszky for the review 🙌
Rationale for this change
Fixes the data loss issue that reported in #3040
What changes are included in this PR?
Ensure that
StreamBytesInput#writeInto(ByteBuffer buffer)
copies data fully, even if the underlyingInputStream
does not report available correctly.Are these changes tested?
UTs are added, I also tested it with an internal production data loss case.
Are there any user-facing changes?
Yes, this fixes some data loss cases, and I acknowledge that the bug affects Spark 4.0.0 preview2 which ships Parquet 1.14.2.
Closes #3040