Closed asfimport closed 7 years ago
Wes McKinney / @wesm: Can you give me access to the data in the S3 bucket?
Wes McKinney / @wesm: The last two errors are probably the same bug, so I should be able to figure it out without the S3 data. Will report back
Wes McKinney / @wesm: Found the problem causing the segfault, patch forthcoming. @xhochy is going to look into PARQUET-812
Wes McKinney / @wesm: PR: https://github.com/apache/arrow/pull/247
When PARQUET-812 is in, I'll update the conda-forge artifacts so you can verify the use case on your environment
Wes McKinney / @wesm: Issue resolved by pull request 247 https://github.com/apache/arrow/pull/247
Wes McKinney / @wesm: We hadn't yet dealt with binary (or non-UTF8 string) data, so there were a couple things to do around there. ARROW-374 (https://github.com/apache/arrow/pull/249) and PARQUET-812 (https://github.com/apache/parquet-cpp/pull/206) are in code review, so it will take a day or so for updated packages to hit conda-forge, but in any case I have:
In [5]: data = open('/home/wesm/Downloads/nation.impala.parquet', 'rb').read()
In [6]: import io
In [7]: buf = io.BytesIO(data)
In [8]: import pyarrow.parquet as pq
In [9]: table = pq.read_table(buf)
In [10]: table.schema
Out[10]:
n_nationkey: int32
n_name: binary
n_regionkey: int32
n_comment: binary
In [11]: table.to_pandas()
Out[11]:
n_nationkey n_name n_regionkey \
0 0 ALGERIA 0
1 1 ARGENTINA 1
2 2 BRAZIL 1
3 3 CANADA 1
4 4 EGYPT 4
5 5 ETHIOPIA 0
6 6 FRANCE 3
7 7 GERMANY 3
8 8 INDIA 2
9 9 INDONESIA 2
10 10 IRAN 4
11 11 IRAQ 4
12 12 JAPAN 2
13 13 JORDAN 4
14 14 KENYA 0
15 15 MOROCCO 0
16 16 MOZAMBIQUE 0
17 17 PERU 1
18 18 CHINA 2
19 19 ROMANIA 3
20 20 SAUDI ARABIA 4
21 21 VIETNAM 2
22 22 RUSSIA 3
23 23 UNITED KINGDOM 3
24 24 UNITED STATES 1
n_comment
0 haggle. carefully final deposits detect slyly...
1 al foxes promise slyly according to the regula...
2 y alongside of the pending deposits. carefully...
3 eas hang ironic, silent packages. slyly regula...
4 y above the carefully unusual theodolites. fin...
5 ven packages wake quickly. regu
6 refully final requests. regular, ironi
7 l platelets. regular accounts x-ray: unusual, ...
8 ss excuses cajole slyly across the packages. d...
9 slyly express asymptotes. regular deposits ha...
10 efully alongside of the slyly final dependenci...
11 nic deposits boost atop the quickly final requ...
12 ously. final, express gifts cajole a
13 ic deposits are blithely about the carefully r...
14 pending excuses haggle furiously deposits. pe...
15 rns. blithely bold courts among the closely re...
16 s. ironic, unusual asymptotes wake blithely r
17 platelets. blithely pending dependencies use f...
18 c dependencies. furiously express notornis sle...
19 ular asymptotes are about the furious multipli...
20 ts. silent requests haggle. closely express pa...
21 hely enticingly express accounts. even, final
22 requests against the platelets use never acco...
23 eans boost carefully special requests. account...
24 y final packages. slow foxes cajole quickly. q...
Wes McKinney / @wesm: artifacts are updated in conda-forge
Matthew Rocklin / @mrocklin: Cool, verified that it works on my end. The taxi data on s3fs is still failing with an encoding error. I've been having difficulty managing permissions on S3 to make this publicly available (just ignorance on my part). In the mean time, here's the status of the files in the "parquet compatibility project"
In [1]: import pyarrow.parquet
In [2]: from glob import glob
In [3]: filenames = sorted(glob('*.parquet'))
In [4]: filenames
Out[4]:
['customer.impala.parquet',
'foo.parquet',
'gzip-nation.impala.parquet',
'nation.dict.parquet',
'nation.impala.parquet',
'nation.plain.parquet',
'snappy-nation.impala.parquet',
'test-converted-type-null.parquet',
'test-null-dictionary.parquet',
'test-null.parquet',
'test.parquet']
In [5]: for fn in filenames:
...: try:
...: t = pyarrow.parquet.read_table(fn)
...: except Exception as e:
...: print('Failed on', fn, e)
...: else:
...: print("Succeeded on", fn)
...:
...:
Succeeded on customer.impala.parquet
Succeeded on foo.parquet
Succeeded on gzip-nation.impala.parquet
Failed on nation.dict.parquet IOError: Unexpected end of stream.
Succeeded on nation.impala.parquet
Succeeded on nation.plain.parquet
Succeeded on snappy-nation.impala.parquet
Succeeded on test-converted-type-null.parquet
Succeeded on test-null-dictionary.parquet
Succeeded on test-null.parquet
Succeeded on test.parquet
In [6]: pyarrow.parquet.read_table('nation.dict.parquet')
---------------------------------------------------------------------------
ArrowException Traceback (most recent call last)
<ipython-input-6-5c2e833b21a9> in <module>()
----> 1 pyarrow.parquet.read_table('nation.dict.parquet')
/home/mrocklin/Software/anaconda/envs/arrow-test/lib/python3.5/site-packages/pyarrow/parquet.pyx in pyarrow.parquet.read_table (/feedstock_root/build_artefacts/work/arrow-268ffbeffb1cd0617e52d381d500a2d10f61124c/python/build/temp.linux-x86_64-3.5/parquet.cxx:2907)()
/home/mrocklin/Software/anaconda/envs/arrow-test/lib/python3.5/site-packages/pyarrow/parquet.pyx in pyarrow.parquet.ParquetReader.read_all (/feedstock_root/build_artefacts/work/arrow-268ffbeffb1cd0617e52d381d500a2d10f61124c/python/build/temp.linux-x86_64-3.5/parquet.cxx:2275)()
/home/mrocklin/Software/anaconda/envs/arrow-test/lib/python3.5/site-packages/pyarrow/error.pyx in pyarrow.error.check_status (/feedstock_root/build_artefacts/work/arrow-268ffbeffb1cd0617e52d381d500a2d10f61124c/python/build/temp.linux-x86_64-3.5/error.cxx:1197)()
ArrowException: IOError: Unexpected end of stream.
Wes McKinney / @wesm: I will look into the taxi data issue if you can get me access to the file (Dropbox/Google Drive is fine too to share).
Where did "nation.dict.parquet" come from originally? I see it in jcrobak's github repo, but I don't see it in github.com/parquet/parquet-compatibility.
Wes McKinney / @wesm: I reported the decoding issue in PARQUET-816
I've conda installed pyarrow and am trying to read data from the parquet-compatibility project. I haven't explicitly built parquet-cpp or anything and may or may not have old versions lying around, so please take this issue with some salt:
Additionally I tried to read data from a Python file-like object pointing to data on S3. Let me know if you'd prefer a separate issue.
Here is a more reproducible version:
I was however pleased with round-trip functionality within this project, which was very pleasant.
Environment: Ubuntu, Python 3.5, installed pyarrow from conda-forge Reporter: Matthew Rocklin / @mrocklin Assignee: Wes McKinney / @wesm
Related issues:
Note: This issue was originally created as ARROW-434. Please see the migration documentation for further details.