apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.3k stars 3.47k forks source link

[Rust] Reading parquet file is slow #23112

Closed asfimport closed 3 years ago

asfimport commented 4 years ago

Using the example at https://github.com/apache/arrow/tree/master/rust/parquet is slow.

The snippet 


let reader = SerializedFileReader::new(file).unwrap();
let mut iter = reader.get_row_iter(None).unwrap();
let start = Instant::now();
while let Some(record) = iter.next() {}
let duration = start.elapsed();
println!("{:?}", duration);

Runs for 17sec for a ~160MB parquet file.

If there is a more effective way to load a parquet file, it would be nice to add it to the readme.

P.S.: My goal is to construct an ndarray from it, I'd be happy for any tips.

Reporter: Adam Lippai / @alippai

Original Issue Attachments:

Note: This issue was originally created as ARROW-6774. Please see the migration documentation for further details.

asfimport commented 4 years ago

Wes McKinney / @wesm: Row-by-row iteration is going to be slow compared with vectorized / column-by-column reads. This unfinished PR was related to this (I think?) but there are Arrow-based readers available that don't require it

https://github.com/apache/arrow/pull/3461

asfimport commented 4 years ago

Adam Lippai / @alippai: I've seen some nice work in https://github.com/apache/arrow/blob/master/rust/parquet/src/column/reader.rs and https://github.com/apache/arrow/blob/master/rust/parquet/src/arrow/array_reader.rs but I couldn't figure it out how to use it. @liurenjie1024  can you help me perhaps? 

asfimport commented 4 years ago

Renjie Liu / @liurenjie1024: This is part of a reader for reading parquet files into arrow arrays. It's almost complete, and we have still one PR (https://github.com/apache/arrow/pull/5523) waiting for review, which contains documentations and examples.

asfimport commented 4 years ago

Adam Lippai / @alippai: While it doesn't support reading Utf8 type, dropping that column then reading the same file takes less than 3 seconds! Thank you for the contribution. (size 10k row * 3k Float64 columns)

asfimport commented 4 years ago

Neville Dipale / @nevi-me: Hi @alippai, UTF8 types are now supported. Is the performance still a concern, or can we close this?

asfimport commented 3 years ago

Sietse Brouwer: I'm not sure what test data @alippai  used, so I used a test data set with 500k rows and two columns:

In all three cases running the snippet took almost exactly 150 seconds, give or take one second.

Does that help you decide whether to close the question, [~nevi_me]? Or perhaps your comment, Adam, from 2019-10-07 used some other version to get that speed improvement? Should I change the test to use the ParquetFileArrowReader example in https://github.com/apache/arrow/blob/3fae71b10c42/rust/parquet/src/arrow/mod.rs#L25-L50, and then this issue can close if that one is faster?

asfimport commented 3 years ago

Adam Lippai / @alippai: [~sietsebb]  I used the Arrow reader method, that time it didn't support all the types I needed, but later it was added. It's definitely faster, however I don't remember the benchmark.

asfimport commented 3 years ago

Sietse Brouwer: @alippai, I can't get parquet::arrow::ParquetFileArrowReader to be faster than parquet::file::reader::SerializedFileReader under commit 3fae71b10c42. Timings below, code below that, conclusions at the bottom. Interesting times in bold.

 

n_rows include utf8-column reader iteration unit
(loop does not iterate over rows within batches)
time taken
- - - - - -
50_000 yes ParquetFileArrowReader 1 batch of 50k rows 14.9s
50_000 yes ParquetFileArrowReader 10 batches of 5k rows 14.8s
50_000 yes ParquetFileArrowReader 50k batches of 1 row 24.0s
50_000 yes SerializedFileReader get_row_iter 14.5s
         
50_000 no ParquetFileArrowReader 1 batch of 50k rows 143ms
50_000 no ParquetFileArrowReader 10 batches of 5k rows 154ms
50_000 no ParquetFileArrowReader 50k batches of 1 row 6.5s
50_000 no SerializedFileReader  get_row_iter 211ms



 

Here is the code I used to load the dataset with ParquetFileArrowReader (see also this version of main.rs):

 
java<br> <br>fn read_with_arrow(file: File) -> () { <br> let file_reader = SerializedFileReader::new(file).unwrap(); <br> let mut arrow_reader = ParquetFileArrowReader::new(Rc::new(file_reader)); <br> println!("Arrow schema is: {}", arrow_reader.get_schema().unwrap()); <br> let mut record_batch_reader = arrow_reader <br> .get_record_reader(/* batch size */ 50000) <br> .unwrap(); <br> <br> let start = Instant::now(); <br> while let Some(_record) = record_batch_reader.next_batch().unwrap() { <br> // no-op <br> }; <br> let duration = start.elapsed(); <br> <br> println!("{:?}", duration); <br>} <br> <br>
 

Main observations:
- we can't tell whether the slow loading when we include the UTF8 column is because UTF8 is slow to process, or because the column is very big (100 random Russian words per cell).
- When the big UTF-8 column is included, iterating over every row with SerializedFileReader is as fast as iterating over a few batches with ParquetFileArrowReader. Even when you skip the rows within the batches!
- Should I try this again with  (size 10k row * 3k Float64 columns) plus one small UTF-8 column?
- I'm not even sure what result I'm trying to reproduce-or-falsify here... whether adding a small UTF-8 column causes disproportional slowdown? Or whether switching between SerializedFileReader and ParquetFileArrowReader causes slowdown? Right now, I feel like everything and nothing is in scope of the issue. I wouldn't mind if somebody made it narrower and clearer.|

asfimport commented 3 years ago

Andrew Lamb / @alamb: Migrated to github: https://github.com/apache/arrow-rs/issues/55