Closed asfimport closed 7 years ago
Wes McKinney / @wesm: @mrocklin I tracked down the source of this bug.
There's a bug in parquet-mr 1.2.8 and lower in which the column chunk metadata in the Parquet file is incorrect. Impala inserted an explicit workaround for this (see See https://github.com/apache/incubator-impala/blob/88448d1d4ab31eaaf82f764b36dc7d11d4c63c32/be/src/exec/hdfs-parquet-scanner.cc#L1227). You didn't hit this bug in the fastparquet Python implementation because you aren't using the total_compressed_size
field to read the entire column chunk into memory before beginning decoding.
In this particular file, the dictionary page header is 15 bytes, and the entire column chunk is:
15 (dict page header) + 277 (dictionary) + 17 (data page header) + 28 (data page) bytes, making 337 bytes.
But the metadata says the column chunk is only 322 bytes – the dict page header size got dropped from the accounting.
Matthew Rocklin / @mrocklin: All I can say is that I'm glad I didn't have to track that one down :)
Wes McKinney / @wesm: Issue resolved by pull request 209 https://github.com/apache/parquet-cpp/pull/209
See attached. This throws an exception when read:
However, I checked that I can read this file with Impala:
Reporter: Wes McKinney / @wesm Assignee: Wes McKinney / @wesm
Related issues:
Original Issue Attachments:
Note: This issue was originally created as PARQUET-816. Please see the migration documentation for further details.