[Rust] Reading parquet file is slow

asfimport commented 4 years ago

Using the example at https://github.com/apache/arrow/tree/master/rust/parquet is slow.

The snippet


let reader = SerializedFileReader::new(file).unwrap();
let mut iter = reader.get_row_iter(None).unwrap();
let start = Instant::now();
while let Some(record) = iter.next() {}
let duration = start.elapsed();
println!("{:?}", duration);

Runs for 17sec for a ~160MB parquet file.

If there is a more effective way to load a parquet file, it would be nice to add it to the readme.

P.S.: My goal is to construct an ndarray from it, I'd be happy for any tips.

Reporter: Adam Lippai / @alippai

Original Issue Attachments:

_{Note: This issue was originally created as ARROW-6774. Please see the migration documentation for further details.}

asfimport commented 4 years ago

Wes McKinney / @wesm: Row-by-row iteration is going to be slow compared with vectorized / column-by-column reads. This unfinished PR was related to this (I think?) but there are Arrow-based readers available that don't require it

https://github.com/apache/arrow/pull/3461

asfimport commented 4 years ago

Adam Lippai / @alippai: I've seen some nice work in https://github.com/apache/arrow/blob/master/rust/parquet/src/column/reader.rs and https://github.com/apache/arrow/blob/master/rust/parquet/src/arrow/array_reader.rs but I couldn't figure it out how to use it. @liurenjie1024 can you help me perhaps?

asfimport commented 4 years ago

Renjie Liu / @liurenjie1024: This is part of a reader for reading parquet files into arrow arrays. It's almost complete, and we have still one PR (https://github.com/apache/arrow/pull/5523) waiting for review, which contains documentations and examples.

asfimport commented 4 years ago

Adam Lippai / @alippai: While it doesn't support reading Utf8 type, dropping that column then reading the same file takes less than 3 seconds! Thank you for the contribution. (size 10k row * 3k Float64 columns)

asfimport commented 4 years ago

Neville Dipale / @nevi-me: Hi @alippai, UTF8 types are now supported. Is the performance still a concern, or can we close this?

asfimport commented 3 years ago

Sietse Brouwer: I'm not sure what test data @alippai used, so I used a test data set with 500k rows and two columns:

a column x containing random floating point numbers,
and a column y where each cell contains a Unicode string of 100 space-separated mostly-cyrillic words.

See the attached data.py. When I saved that 500k-row table as parquet with gzip compression, the resulting file was 174 MB.

I tried running Adam's test snippet (the code I used is attached as main.rs) while compiling with different versions of parquet:
parquet=0.15.1
parquet=1.0.1
parquet=2.0.0-SNAPSHOT (specifically git:3fae71b10c42 of 2020-09-30).

In all three cases running the snippet took almost exactly 150 seconds, give or take one second.

Does that help you decide whether to close the question, [~nevi_me]? Or perhaps your comment, Adam, from 2019-10-07 used some other version to get that speed improvement? Should I change the test to use the ParquetFileArrowReader example in https://github.com/apache/arrow/blob/3fae71b10c42/rust/parquet/src/arrow/mod.rs#L25-L50, and then this issue can close if that one is faster?

asfimport commented 3 years ago

Adam Lippai / @alippai: [~sietsebb] I used the Arrow reader method, that time it didn't support all the types I needed, but later it was added. It's definitely faster, however I don't remember the benchmark.

asfimport commented 3 years ago

Sietse Brouwer: @alippai, I can't get parquet::arrow::ParquetFileArrowReader to be faster than parquet::file::reader::SerializedFileReader under commit 3fae71b10c42. Timings below, code below that, conclusions at the bottom. Interesting times in bold.

n_rows	include utf8-column	reader	iteration unit (loop does not iterate over rows within batches)	time taken
-	-	-	-	-	-
50_000	yes	ParquetFileArrowReader	1 batch of 50k rows	14.9s
50_000	yes	ParquetFileArrowReader	10 batches of 5k rows	14.8s
50_000	yes	ParquetFileArrowReader	50k batches of 1 row	24.0s
50_000	yes	SerializedFileReader	get_row_iter	14.5s

50_000	no	ParquetFileArrowReader	1 batch of 50k rows	143ms
50_000	no	ParquetFileArrowReader	10 batches of 5k rows	154ms
50_000	no	ParquetFileArrowReader	50k batches of 1 row	6.5s
50_000	no	SerializedFileReader	get_row_iter	211ms

Here is the code I used to load the dataset with ParquetFileArrowReader (see also this version of main.rs):

java fn read_with_arrow(file: File) -> () { let file_reader = SerializedFileReader::new(file).unwrap(); let mut arrow_reader = ParquetFileArrowReader::new(Rc::new(file_reader)); println!("Arrow schema is: {}", arrow_reader.get_schema().unwrap()); let mut record_batch_reader = arrow_reader .get_record_reader(/* batch size */ 50000) .unwrap(); let start = Instant::now(); while let Some(_record) = record_batch_reader.next_batch().unwrap() { // no-op }; let duration = start.elapsed(); println!("{:?}", duration); } 

Main observations:
- we can't tell whether the slow loading when we include the UTF8 column is because UTF8 is slow to process, or because the column is very big (100 random Russian words per cell).
- When the big UTF-8 column is included, iterating over every row with SerializedFileReader is as fast as iterating over a few batches with ParquetFileArrowReader. Even when you skip the rows within the batches!
- Should I try this again with (size 10k row * 3k Float64 columns) plus one small UTF-8 column?
- I'm not even sure what result I'm trying to reproduce-or-falsify here... whether adding a small UTF-8 column causes disproportional slowdown? Or whether switching between SerializedFileReader and ParquetFileArrowReader causes slowdown? Right now, I feel like everything and nothing is in scope of the issue. I wouldn't mind if somebody made it narrower and clearer.|

asfimport commented 3 years ago

Andrew Lamb / @alamb: Migrated to github: https://github.com/apache/arrow-rs/issues/55

apache / arrow

[Rust] Reading parquet file is slow #23112

Original Issue Attachments: