apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.29k stars 3.47k forks source link

Segfaults and encoding issues in Python Parquet reads #16081

Closed asfimport closed 7 years ago

asfimport commented 7 years ago

I've conda installed pyarrow and am trying to read data from the parquet-compatibility project. I haven't explicitly built parquet-cpp or anything and may or may not have old versions lying around, so please take this issue with some salt:

In [1]: import pyarrow.parquet

In [2]: t = pyarrow.parquet.read_table('nation.plain.parquet')
---------------------------------------------------------------------------
ArrowException                            Traceback (most recent call last)
<ipython-input-2-5d966681a384> in <module>()
----> 1 t = pyarrow.parquet.read_table('nation.plain.parquet')

/home/mrocklin/Software/anaconda/lib/python3.5/site-packages/pyarrow/parquet.pyx in pyarrow.parquet.read_table (/feedstock_root/build_artefacts/work/arrow-79344b335849c2eb43954b0751018051814019d6/python/build/temp.linux-x86_64-3.5/parquet.cxx:2783)()

/home/mrocklin/Software/anaconda/lib/python3.5/site-packages/pyarrow/parquet.pyx in pyarrow.parquet.ParquetReader.read_all (/feedstock_root/build_artefacts/work/arrow-79344b335849c2eb43954b0751018051814019d6/python/build/temp.linux-x86_64-3.5/parquet.cxx:2200)()

/home/mrocklin/Software/anaconda/lib/python3.5/site-packages/pyarrow/error.pyx in pyarrow.error.check_status (/feedstock_root/build_artefacts/work/arrow-79344b335849c2eb43954b0751018051814019d6/python/build/temp.linux-x86_64-3.5/error.cxx:1185)()

ArrowException: NotImplemented: list<: uint8>

Additionally I tried to read data from a Python file-like object pointing to data on S3. Let me know if you'd prefer a separate issue.

In [1]: import s3fs

In [2]: fs = s3fs.S3FileSystem()

In [3]: f = fs.open('dask-data/nyc-taxi/2015/parquet/part.0.parquet')

In [4]: f.read(100)
Out[4]: b'PAR1\x15\x00\x15\x90\xc4\xa2\x12\x15\x90\xc4\xa2\x12,\x15\xc2\xa8\xa4\x02\x15\x00\x15\x06\x15\x08\x00\x00\x00\x80\xbf\xe7\x8b\x0b\x05\x00\x00\x80\xbf\xe7\x8b\x0b\x05\x00\x00\x80\xbf\xe7\x8b\x0b\x05\x00@\xc2\xce\xe7\x8b\x0b\x05\x00\xc0F\xed\xe7\x8b\x0b\x05\x00\xc0F\xed\xe7\x8b\x0b\x05\x00\x00\x89\xfc\xe7\x8b\x0b\x05\x00@\xcb\x0b\xe8\x8b\x0b\x05\x00\x80\r\x1b\xe8\x8b\x0b'

In [5]: import pyarrow.parquet

In [6]: t = pyarrow.parquet.read_table(f)
Segmentation fault (core dumped)

Here is a more reproducible version:

In [1]: with open('nation.plain.parquet', 'rb') as f:
   ...:     data = f.read()
   ...:     

In [2]: from io import BytesIO

In [3]: f = BytesIO(data)

In [4]: f.seek(0)
Out[4]: 0

In [5]: import pyarrow.parquet

In [6]: t = pyarrow.parquet.read_table(f)
Segmentation fault (core dumped)

I was however pleased with round-trip functionality within this project, which was very pleasant.

Environment: Ubuntu, Python 3.5, installed pyarrow from conda-forge Reporter: Matthew Rocklin / @mrocklin Assignee: Wes McKinney / @wesm

Related issues:

Note: This issue was originally created as ARROW-434. Please see the migration documentation for further details.

asfimport commented 7 years ago

Wes McKinney / @wesm: Can you give me access to the data in the S3 bucket?

asfimport commented 7 years ago

Wes McKinney / @wesm: The last two errors are probably the same bug, so I should be able to figure it out without the S3 data. Will report back

asfimport commented 7 years ago

Wes McKinney / @wesm: Found the problem causing the segfault, patch forthcoming. @xhochy is going to look into PARQUET-812

asfimport commented 7 years ago

Wes McKinney / @wesm: PR: https://github.com/apache/arrow/pull/247

When PARQUET-812 is in, I'll update the conda-forge artifacts so you can verify the use case on your environment

asfimport commented 7 years ago

Wes McKinney / @wesm: Issue resolved by pull request 247 https://github.com/apache/arrow/pull/247

asfimport commented 7 years ago

Wes McKinney / @wesm: We hadn't yet dealt with binary (or non-UTF8 string) data, so there were a couple things to do around there. ARROW-374 (https://github.com/apache/arrow/pull/249) and PARQUET-812 (https://github.com/apache/parquet-cpp/pull/206) are in code review, so it will take a day or so for updated packages to hit conda-forge, but in any case I have:

In [5]: data = open('/home/wesm/Downloads/nation.impala.parquet', 'rb').read()

In [6]: import io

In [7]: buf = io.BytesIO(data)

In [8]: import pyarrow.parquet as pq

In [9]: table = pq.read_table(buf)

In [10]: table.schema
Out[10]: 
n_nationkey: int32
n_name: binary
n_regionkey: int32
n_comment: binary

In [11]: table.to_pandas()
Out[11]: 
    n_nationkey          n_name  n_regionkey  \
0             0         ALGERIA            0   
1             1       ARGENTINA            1   
2             2          BRAZIL            1   
3             3          CANADA            1   
4             4           EGYPT            4   
5             5        ETHIOPIA            0   
6             6          FRANCE            3   
7             7         GERMANY            3   
8             8           INDIA            2   
9             9       INDONESIA            2   
10           10            IRAN            4   
11           11            IRAQ            4   
12           12           JAPAN            2   
13           13          JORDAN            4   
14           14           KENYA            0   
15           15         MOROCCO            0   
16           16      MOZAMBIQUE            0   
17           17            PERU            1   
18           18           CHINA            2   
19           19         ROMANIA            3   
20           20    SAUDI ARABIA            4   
21           21         VIETNAM            2   
22           22          RUSSIA            3   
23           23  UNITED KINGDOM            3   
24           24   UNITED STATES            1   

                                            n_comment  
0    haggle. carefully final deposits detect slyly...  
1   al foxes promise slyly according to the regula...  
2   y alongside of the pending deposits. carefully...  
3   eas hang ironic, silent packages. slyly regula...  
4   y above the carefully unusual theodolites. fin...  
5                     ven packages wake quickly. regu  
6              refully final requests. regular, ironi  
7   l platelets. regular accounts x-ray: unusual, ...  
8   ss excuses cajole slyly across the packages. d...  
9    slyly express asymptotes. regular deposits ha...  
10  efully alongside of the slyly final dependenci...  
11  nic deposits boost atop the quickly final requ...  
12               ously. final, express gifts cajole a  
13  ic deposits are blithely about the carefully r...  
14   pending excuses haggle furiously deposits. pe...  
15  rns. blithely bold courts among the closely re...  
16      s. ironic, unusual asymptotes wake blithely r  
17  platelets. blithely pending dependencies use f...  
18  c dependencies. furiously express notornis sle...  
19  ular asymptotes are about the furious multipli...  
20  ts. silent requests haggle. closely express pa...  
21     hely enticingly express accounts. even, final   
22   requests against the platelets use never acco...  
23  eans boost carefully special requests. account...  
24  y final packages. slow foxes cajole quickly. q... 
asfimport commented 7 years ago

Wes McKinney / @wesm: artifacts are updated in conda-forge

asfimport commented 7 years ago

Matthew Rocklin / @mrocklin: Cool, verified that it works on my end. The taxi data on s3fs is still failing with an encoding error. I've been having difficulty managing permissions on S3 to make this publicly available (just ignorance on my part). In the mean time, here's the status of the files in the "parquet compatibility project"

In [1]: import pyarrow.parquet

In [2]: from glob import glob

In [3]: filenames = sorted(glob('*.parquet'))

In [4]: filenames
Out[4]: 
['customer.impala.parquet',
 'foo.parquet',
 'gzip-nation.impala.parquet',
 'nation.dict.parquet',
 'nation.impala.parquet',
 'nation.plain.parquet',
 'snappy-nation.impala.parquet',
 'test-converted-type-null.parquet',
 'test-null-dictionary.parquet',
 'test-null.parquet',
 'test.parquet']

In [5]: for fn in filenames:
   ...:     try:
   ...:         t = pyarrow.parquet.read_table(fn)
   ...:     except Exception as e:
   ...:         print('Failed on', fn, e)
   ...:     else:
   ...:         print("Succeeded on", fn)
   ...:         
   ...:     
Succeeded on customer.impala.parquet
Succeeded on foo.parquet
Succeeded on gzip-nation.impala.parquet
Failed on nation.dict.parquet IOError: Unexpected end of stream.
Succeeded on nation.impala.parquet
Succeeded on nation.plain.parquet
Succeeded on snappy-nation.impala.parquet
Succeeded on test-converted-type-null.parquet
Succeeded on test-null-dictionary.parquet
Succeeded on test-null.parquet
Succeeded on test.parquet

In [6]: pyarrow.parquet.read_table('nation.dict.parquet')
---------------------------------------------------------------------------
ArrowException                            Traceback (most recent call last)
<ipython-input-6-5c2e833b21a9> in <module>()
----> 1 pyarrow.parquet.read_table('nation.dict.parquet')

/home/mrocklin/Software/anaconda/envs/arrow-test/lib/python3.5/site-packages/pyarrow/parquet.pyx in pyarrow.parquet.read_table (/feedstock_root/build_artefacts/work/arrow-268ffbeffb1cd0617e52d381d500a2d10f61124c/python/build/temp.linux-x86_64-3.5/parquet.cxx:2907)()

/home/mrocklin/Software/anaconda/envs/arrow-test/lib/python3.5/site-packages/pyarrow/parquet.pyx in pyarrow.parquet.ParquetReader.read_all (/feedstock_root/build_artefacts/work/arrow-268ffbeffb1cd0617e52d381d500a2d10f61124c/python/build/temp.linux-x86_64-3.5/parquet.cxx:2275)()

/home/mrocklin/Software/anaconda/envs/arrow-test/lib/python3.5/site-packages/pyarrow/error.pyx in pyarrow.error.check_status (/feedstock_root/build_artefacts/work/arrow-268ffbeffb1cd0617e52d381d500a2d10f61124c/python/build/temp.linux-x86_64-3.5/error.cxx:1197)()

ArrowException: IOError: Unexpected end of stream.
asfimport commented 7 years ago

Wes McKinney / @wesm: I will look into the taxi data issue if you can get me access to the file (Dropbox/Google Drive is fine too to share).

Where did "nation.dict.parquet" come from originally? I see it in jcrobak's github repo, but I don't see it in github.com/parquet/parquet-compatibility.

asfimport commented 7 years ago

Wes McKinney / @wesm: I reported the decoding issue in PARQUET-816