apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.31k stars 3.48k forks source link

[C++][Parquet] Failure decoding sample dict-encoded file from parquet-compatibility project #42298

Closed asfimport closed 7 years ago

asfimport commented 7 years ago

See attached. This throws an exception when read:

$ debug/parquet_reader nation.dict.parquet 
File statistics:
Version: 1
Created By: parquet-mr
Total rows: 25
Number of RowGroups: 1
Number of Real Columns: 4
Number of Columns: 4
Number of Selected Columns: 4
Column 0: nation_key (INT32)
Column 1: name (BYTE_ARRAY)
Column 2: region_key (INT32)
Column 3: comment_col (BYTE_ARRAY)
--- Row Group 0 ---
--- Total Bytes 0 ---
  rows: 25---
Column 0
, values: 25  Statistics Not Set
  compression: UNCOMPRESSED, encodings: 
  uncompressed size: 125, compressed size: 125
Column 1
, values: 25  Statistics Not Set
  compression: UNCOMPRESSED, encodings: 
  uncompressed size: 322, compressed size: 322
Column 2
, values: 25  Statistics Not Set
  compression: UNCOMPRESSED, encodings: 
  uncompressed size: 125, compressed size: 125
Column 3
, values: 25  Statistics Not Set
  compression: UNCOMPRESSED, encodings: 
  uncompressed size: 2002, compressed size: 2002
nation_key              name                    region_key              comment_col             
0                       Parquet error: Unexpected end of stream.

However, I checked that I can read this file with Impala:

In [13]: hdfs.put('/tmp/nation-dict-test/test.parq', 'nation.dict.parquet')
Out[13]: '/tmp/nation-dict-test/test.parq'

In [14]: pf = con.parquet_file('/tmp/nation-dict-test')

In [15]: pf.execute()
Out[15]: 
    nation_key            name  region_key  \
0            0         ALGERIA           0   
1            1       ARGENTINA           1   
2            2          BRAZIL           1   
3            3          CANADA           1   
4            4           EGYPT           4   
5            5        ETHIOPIA           0   
6            6          FRANCE           3   
7            7         GERMANY           3   
8            8           INDIA           2   
9            9       INDONESIA           2   
10          10            IRAN           4   
11          11            IRAQ           4   
12          12           JAPAN           2   
13          13          JORDAN           4   
14          14           KENYA           0   
15          15         MOROCCO           0   
16          16      MOZAMBIQUE           0   
17          17            PERU           1   
18          18           CHINA           2   
19          19         ROMANIA           3   
20          20    SAUDI ARABIA           4   
21          21         VIETNAM           2   
22          22          RUSSIA           3   
23          23  UNITED KINGDOM           3   
24          24   UNITED STATES           1   

                                          comment_col  
0    haggle. carefully final deposits detect slyly...  
1   al foxes promise slyly according to the regula...  
2   y alongside of the pending deposits. carefully...  
3   eas hang ironic, silent packages. slyly regula...  
4   y above the carefully unusual theodolites. fin...  
5                     ven packages wake quickly. regu  
6              refully final requests. regular, ironi  
7   l platelets. regular accounts x-ray: unusual, ...  
8   ss excuses cajole slyly across the packages. d...  
9    slyly express asymptotes. regular deposits ha...  
10  efully alongside of the slyly final dependenci...  
11  nic deposits boost atop the quickly final requ...  
12               ously. final, express gifts cajole a  
13  ic deposits are blithely about the carefully r...  
14   pending excuses haggle furiously deposits. pe...  
15  rns. blithely bold courts among the closely re...  
16      s. ironic, unusual asymptotes wake blithely r  
17  platelets. blithely pending dependencies use f...  
18  c dependencies. furiously express notornis sle...  
19  ular asymptotes are about the furious multipli...  
20  ts. silent requests haggle. closely express pa...  
21     hely enticingly express accounts. even, final   
22   requests against the platelets use never acco...  
23  eans boost carefully special requests. account...  
24  y final packages. slow foxes cajole quickly. q...  

Reporter: Wes McKinney / @wesm Assignee: Wes McKinney / @wesm

Related issues:

Note: This issue was originally created as PARQUET-816. Please see the migration documentation for further details.

asfimport commented 7 years ago

Wes McKinney / @wesm: @mrocklin I tracked down the source of this bug.

There's a bug in parquet-mr 1.2.8 and lower in which the column chunk metadata in the Parquet file is incorrect. Impala inserted an explicit workaround for this (see See https://github.com/apache/incubator-impala/blob/88448d1d4ab31eaaf82f764b36dc7d11d4c63c32/be/src/exec/hdfs-parquet-scanner.cc#L1227). You didn't hit this bug in the fastparquet Python implementation because you aren't using the total_compressed_size field to read the entire column chunk into memory before beginning decoding.

In this particular file, the dictionary page header is 15 bytes, and the entire column chunk is:

15 (dict page header) + 277 (dictionary) + 17 (data page header) + 28 (data page) bytes, making 337 bytes.

But the metadata says the column chunk is only 322 bytes – the dict page header size got dropped from the accounting.

asfimport commented 7 years ago

Matthew Rocklin / @mrocklin: All I can say is that I'm glad I didn't have to track that one down :)

asfimport commented 7 years ago

Wes McKinney / @wesm: PR: https://github.com/apache/parquet-cpp/pull/209

asfimport commented 7 years ago

Wes McKinney / @wesm: Issue resolved by pull request 209 https://github.com/apache/parquet-cpp/pull/209