Describe the bug, including details regarding any error messages, version, and platform.
Background
I get some data loss reports after upgrading the internal Spark's Parquet from 1.13.1 to 1.14.3, after some experiments, I believe this should be a bug on the Parquet side, and it could be worked around by disabling spark.sql.parquet.filterPushdown.
The issue is, during the evaluation of DictionaryFilter.canDrop(this happens when reading a column that has PLAIN_DICTIONARY with pushed predications), when dict size exceeds 8k, only the head 8k was copied
Describe the bug, including details regarding any error messages, version, and platform.
Background
I get some data loss reports after upgrading the internal Spark's Parquet from 1.13.1 to 1.14.3, after some experiments, I believe this should be a bug on the Parquet side, and it could be worked around by disabling
spark.sql.parquet.filterPushdown
.Analysis
With some debugging, I think the issue was introduced by PARQUET-2432(https://github.com/apache/parquet-java/pull/1278).
The issue is, during the evaluation of
DictionaryFilter.canDrop
(this happens when reading a column that hasPLAIN_DICTIONARY
with pushed predications), when dict size exceeds 8k, only the head 8k was copiedhttps://github.com/apache/parquet-java/blob/274dc51bc9e5cc880ba3c77c3db826d2a4943965/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/DictionaryPageReader.java#L113
the correct data
the copied data
the root cause is https://github.com/apache/parquet-java/blob/274dc51bc9e5cc880ba3c77c3db826d2a4943965/parquet-common/src/main/java/org/apache/parquet/bytes/BytesInput.java#L379
may not read fully if the underlying
InputStream
'savailable
method always returns 0Component(s)
Core