apache / parquet-java

Apache Parquet Java
https://parquet.apache.org/
Apache License 2.0
2.55k stars 1.4k forks source link

[Python] [Rust] Parquet read file fails with batch size 1_000_000 and 41 row groups #2491

Open asfimport opened 4 years ago

asfimport commented 4 years ago

Here is the error I got:

Pyarrow:


>>> df = pd.read_parquet("test.parquet", engine="pyarrow")
 Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 File "/home/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 296, in read_parquet
 return impl.read(path, columns=columns, \*\*kwargs)
 File "/home/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 125, in read
 path, columns=columns, \*\*kwargs
 File "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", line 1281, in read_table
 use_pandas_metadata=use_pandas_metadata)
 File "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", line 1137, in read
 use_pandas_metadata=use_pandas_metadata)
 File "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", line 605, in read
 table = reader.read(\*\*options)
 File "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", line 253, in read
 use_threads=use_threads)
 File "pyarrow/_parquet.pyx", line 1136, in pyarrow._parquet.ParquetReader.read_all
 File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
 OSError: Unexpected end of stream

fastparquet:

 >>> df = pd.read_parquet("test.parquet", engine="fastparquet")
 /home/miniconda3/envs/ds/lib/python3.7/site-packages/fastparquet/encoding.py:222: NumbaDeprecationWarning: The 'numba.jitclass' decorator has moved to 'numba.experimental.jitclass' to better reflect the experimental nature of the functionality. Please update your imports to accommodate this change and see <http://numba.pydata.org/numba-doc/latest/reference/deprecation.html#change-of-jitclass-location> for the time frame.
 Numpy8 = numba.jitclass(spec8)(NumpyIO)
 /home/miniconda3/envs/ds/lib/python3.7/site-packages/fastparquet/encoding.py:224: NumbaDeprecationWarning: The 'numba.jitclass' decorator has moved to 'numba.experimental.jitclass' to better reflect the experimental nature of the functionality. Please update your imports to accommodate this change and see <http://numba.pydata.org/numba-doc/latest/reference/deprecation.html#change-of-jitclass-location> for the time frame.
 Numpy32 = numba.jitclass(spec32)(NumpyIO)
 Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 File "/home/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 296, in read_parquet
 return impl.read(path, columns=columns, \*\*kwargs)
 File "/home/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 201, in read
 return parquet_file.to_pandas(columns=columns, \*\*kwargs)
 File "/home/miniconda3/envs/ds/lib/python3.7/site-packages/fastparquet/api.py", line 399, in to_pandas
 index=index, assign=parts)
 File "/home/miniconda3/envs/ds/lib/python3.7/site-packages/fastparquet/api.py", line 228, in read_row_group
 scheme=self.file_scheme)
 File "/home/miniconda3/envs/ds/lib/python3.7/site-packages/fastparquet/core.py", line 354, in read_row_group
 cats, selfmade, assign=assign)
 File "/home/miniconda3/envs/ds/lib/python3.7/site-packages/fastparquet/core.py", line 331, in read_row_group_arrays
 catdef=out.get(name+'-catdef', None))
 File "/home/miniconda3/envs/ds/lib/python3.7/site-packages/fastparquet/core.py", line 245, in read_col
 skip_nulls, selfmade=selfmade)
 File "/home/miniconda3/envs/ds/lib/python3.7/site-packages/fastparquet/core.py", line 99, in read_data_page
 raw_bytes = _read_page(f, header, metadata)
 File "/home/miniconda3/envs/ds/lib/python3.7/site-packages/fastparquet/core.py", line 31, in _read_page
 page_header.uncompressed_page_size)
 AssertionError: found 120016208 raw bytes (expected None)

The corresponding Rust code is:


use parquet::{
 column::writer::ColumnWriter::BoolColumnWriter,
 column::writer::ColumnWriter::Int32ColumnWriter,
 [file::]

{ properties::WriterProperties, writer::

{FileWriter, SerializedFileWriter}

,
 },
 schema::parser::parse_message_type,
 };
 use std::\{fs, rc::Rc};

fn main() {
 let schema = "
 message schema

{ REQUIRED INT32 a; REQUIRED BOOLEAN b; }

";

let schema = Rc::new(parse_message_type(schema).unwrap());
 let props = Rc::new(
 WriterProperties::builder()
 .set_statistics_enabled(false)
 .set_dictionary_enabled(false)
 .build(),
 );
 let file = fs::File::create("test.parquet").unwrap();
 let mut writer = SerializedFileWriter::new(file, schema, props).unwrap();
 let batch_size = 1_000_000;
 let mut data = vec![];
 let mut data_bool = vec![];
 for i in 0..batch_size

{ data.push(i); data_bool.push(true); }

let mut j = 0;
 loop {
 let mut row_group_writer = writer.next_row_group().unwrap();
 let mut col_writer = row_group_writer.next_column().unwrap().unwrap();
 if let Int32ColumnWriter(ref mut typed_writer) = col_writer

{ typed_writer.write_batch(&data, None, None).unwrap(); }

else

{ panic!(); }

row_group_writer.close_column(col_writer).unwrap();
 let mut col_writer = row_group_writer.next_column().unwrap().unwrap();
 if let BoolColumnWriter(ref mut typed_writer) = col_writer \{ typed_writer.write_batch(&data_bool, None, None).unwrap(); } else \{ panic!(); }

row_group_writer.close_column(col_writer).unwrap();
 writer.close_row_group(row_group_writer).unwrap();

j += 1;
 if j \* batch_size > 40_000_000

{ break; }

}
 writer.close().unwrap()
 }

 

Reporter: Novice

Related issues:

Note: This issue was originally created as PARQUET-1858. Please see the migration documentation for further details.

asfimport commented 4 years ago

Wes McKinney / @wesm: The PLAIN encoding for the boolean type is possibly malformed. I opened PARQUET-1859 about providing better error messages, but here is what the failure is


$ python test.py 
Traceback (most recent call last):
  File "test.py", line 7, in <module>
    pq.read_table(path)
  File "/home/wesm/code/arrow/python/pyarrow/parquet.py", line 1539, in read_table
    use_pandas_metadata=use_pandas_metadata)
  File "/home/wesm/code/arrow/python/pyarrow/parquet.py", line 1264, in read
    use_pandas_metadata=use_pandas_metadata)
  File "/home/wesm/code/arrow/python/pyarrow/parquet.py", line 707, in read
    table = reader.read(**options)
  File "/home/wesm/code/arrow/python/pyarrow/parquet.py", line 337, in read
    use_threads=use_threads)
  File "pyarrow/_parquet.pyx", line 1130, in pyarrow._parquet.ParquetReader.read_all
    check_status(self.reader.get()
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
    raise IOError(message)
OSError: Unexpected end of stream: Failed to decode 1000000 bits for boolean PLAIN encoding only decoded 2048
In ../src/parquet/arrow/reader.cc, line 844, code: final_status

Can this file be read by the Java library?

asfimport commented 4 years ago

Novice: Did you mean Rust? :)

I haven't tried, my workflow is write using Rust and read from Python.

asfimport commented 4 years ago

Wes McKinney / @wesm: Yes it looks like the file written by Rust is malformed. That two independent implementations fail is good evidence of that.

asfimport commented 3 years ago

ii: This is blocking me pretty hard right now, especially since I can't work around it by setting my boolean columns to use RLE because pyarrow doesn't seem to support that encoding.

Is there anything I can do to help? I've tried dumping the parquet file generated by my Rust code using parquet-tools cat -j and it seems to work fine, including all the boolean values.