Open asfimport opened 7 years ago
Cheng Lian / @liancheng:
The current write path ensures that it never writes a page that is larger than 2GB, but the read path may read 1 or more column chunks consisting of multiple pages into a single byte array (or ByteBuffer
) no larger than 2GB.
We hit this issue in production because the data distribution happened to be similar to the situation mentioned in the JIRA description and produced a skewed row group containing a column chunk larger than 2GB.
I think there are two separate issues to fix:
ConsecutiveChunkList.readAll()
method should support reading data larger than 2GB, probably by using multiple buffers.Another option is to ensure that no row groups larger than 2GB can be ever written. Thoughts?
BTW, the parquet-python library can read this kind of malformed Parquet files successfully with this patch. We used it to recover our data from the malformed Parquet file.
I don't think this was fixed in 1.9.x, having the same issue on 1.14.1
Parquet MR 1.8.2 does not support reading row groups which are larger than 2 GB. See:https://github.com/apache/parquet-mr/blob/parquet-1.8.x/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L1064
We are seeing this when writing skewed records. This throws off the estimation of the memory check interval in the InternalParquetRecordWriter. The following spark code illustrates this:
The latter fails with the following exception:
-This seems to be fixed by commit https://github.com/apache/parquet-mr/commit/6b605a4ea05b66e1a6bf843353abcb4834a4ced8 in parquet 1.9.x. Is there any chance that we can fix this in 1.8.x?-
Reporter: Herman van Hövell
Note: This issue was originally created as PARQUET-980. Please see the migration documentation for further details.