apache / arrow-rs

Official Rust implementation of Apache Arrow
https://arrow.apache.org/
Apache License 2.0
2.59k stars 786 forks source link

[Parquet] Reading parquet file into an ndarray #53

Open alamb opened 3 years ago

alamb commented 3 years ago

Note: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-6712

What's the best way to read a .parquet file into a rust ndarray structure?

Can it be effective with the current API? I assume row iteration is not the best idea :) 

I can imagine that even parallel column loading would be possible. 

Sach1nAgarwal commented 1 year ago

Parallel column is increasing the performance. I checked by creating multiple ParquetRecordBatchStream<Reader> for each column and all ParquetRecordBatchStream<Reader> reading parallely, performance increase from 60Mega Bytes / sec (when single ParquetRecordBatchStream<Reader> for a file) to 200 Mega Bytes / sec. Reading a Parquet files, all contains 8 columns and of size of approximately 250MB.

Instead of creating multiple readers can it be done by single reader?

alamb commented 1 year ago

Instead of creating multiple readers can it be done by single reader?

That is an interesting question @Sach1nAgarwal -- I think @tustvold has some ideas on parallelized decode but I am not sure how concrete they are