aloneguid / parquet-dotnet

Fully managed Apache Parquet implementation
https://aloneguid.github.io/parquet-dotnet/
MIT License
542 stars 140 forks source link

Feature requst: extend DataColumn API to read column values directly into provided Span/Memory/Array #507

Open i-sinister opened 2 months ago

i-sinister commented 2 months ago

Issue description

I have a usecase where I need to read rather large parquet files - 5Gb-50Gb, 100 to 10000 groups with 1_000_000-20_000_000 rows in a group. Groups sizes are limited and known beforehand and groups can be processed independently. So I would like to preallocate column value arrays once (or actually twice) and then read values from column directly into preallocated array/Span/Memory while iterating groups.

Pragmateek commented 1 month ago

Interesting, I have the exact same need to concatenate files: #515

aloneguid commented 4 weeks ago

It should be possible soon but needs some refactoring and possibly breaking changes. This library was created before Span existed ;)

i-sinister commented 4 weeks ago

Rust crate to work with parquet files has really nice api (also loading data works 2 times faster (-:): https://docs.rs/parquet/51.0.0/parquet/column/reader/struct.GenericColumnReader.html#method.read_records

let mut values = vec![];
...
for group_index in 0..group_count {
            let group_reader = file_reader.get_row_group(group_index).unwrap();
            let group_metadata = metadata.row_group(group_index);
            let group_row_count = group_metadata.num_rows() as u64;
            if let Ok(ColumnReader::Int96ColumnReader(ref mut column_reader)) = group_reader.get_column_reader(0) {
                values.clear();
                column_reader.read_records(group_row_count as usize, None, None, &mut values).unwrap();
            };
}
aloneguid commented 4 weeks ago

Usually when someone says "x times faster" it's a clickbait ;) Need to see performance measurement methodology and actual numbers.

Pragmateek commented 3 weeks ago

Usually when someone says "x times faster" it's a clickbait ;) Need to see performance measurement methodology and actual numbers.

You mean like this one? ;)

aloneguid commented 3 weeks ago

Yeah that's wrong. ParuqetSharp is actually slower, despite being a c++ wrapper. Notice lack of reference to benchmarking code, data set size, platform etc. Maybe I should publish detailed numbers and also put them on the front page :)

aloneguid commented 3 weeks ago

Writing 1 million rows on Linux x64 with parquet.net vs parquetsharp:

Method DataType Mean Error StdDev Gen0 Gen1 Gen2 Allocated
ParquetNet Double 17.423 ms 10.6128 ms 0.5817 ms 187.5000 187.5000 187.5000 19463.68 KB
ParquetSharp Double 31.025 ms 15.2193 ms 0.8342 ms 937.5000 937.5000 937.5000 19615.25 KB
ParquetNet Int32 6.098 ms 2.4644 ms 0.1351 ms 187.5000 - - 774.65 KB
ParquetSharp Int32 22.634 ms 4.2477 ms 0.2328 ms 1000.0000 1000.0000 1000.0000 19900.72 KB
ParquetNet Double? 1.375 ms 1.3348 ms 0.0732 ms 48.8281 1.9531 - 198.42 KB
ParquetSharp Double? 4.003 ms 1.0387 ms 0.0569 ms 54.6875 7.8125 - 249.16 KB
ParquetNet Int32? 1.071 ms 0.6018 ms 0.0330 ms 1.9531 - - 14.11 KB
ParquetSharp Int32? 3.434 ms 0.6685 ms 0.0366 ms 54.6875 15.6250 - 237.34 KB

Basically Parquet.Net is on average 3 times faster and uses less RAM, often considerably less.

Pragmateek commented 3 weeks ago

Yeah that's wrong. ParuqetSharp is actually slower, despite being a c++ wrapper. Notice lack of reference to benchmarking code, data set size, platform etc. Maybe I should publish detailed numbers and also put them on the front page :)

I was just kidding, no need to show off on the front page. 😅 For me the real killer feature is the bidirectional serialization, kind of Object Parquet Mapping.

Pragmateek commented 3 weeks ago

Writing 1 million rows on Linux x64 with parquet.net vs parquetsharp:

Method DataType Mean Error StdDev Gen0 Gen1 Gen2 Allocated ParquetNet Double 17.423 ms 10.6128 ms 0.5817 ms 187.5000 187.5000 187.5000 19463.68 KB ParquetSharp Double 31.025 ms 15.2193 ms 0.8342 ms 937.5000 937.5000 937.5000 19615.25 KB ParquetNet Int32 6.098 ms 2.4644 ms 0.1351 ms 187.5000 - - 774.65 KB ParquetSharp Int32 22.634 ms 4.2477 ms 0.2328 ms 1000.0000 1000.0000 1000.0000 19900.72 KB ParquetNet Double? 1.375 ms 1.3348 ms 0.0732 ms 48.8281 1.9531 - 198.42 KB ParquetSharp Double? 4.003 ms 1.0387 ms 0.0569 ms 54.6875 7.8125 - 249.16 KB ParquetNet Int32? 1.071 ms 0.6018 ms 0.0330 ms 1.9531 - - 14.11 KB ParquetSharp Int32? 3.434 ms 0.6685 ms 0.0366 ms 54.6875 15.6250 - 237.34 KB Basically Parquet.Net is on average 3 times faster and uses less RAM, often considerably less.

Impressive results, keep it up! 👏

i-sinister commented 3 weeks ago

I was trying to tell that rust version is faster, not different implementation for c#. One of the reasons is that API allows to read into preallocated arrays and does not require GC.

Here are the numbers I'm having when trying to read 4 columns, 412M rows from 38Gb file with 34 columns in 83 groups 5M rows each on windows:

version duration
rust read 412395458 rows in 83 groups in 12.44s
net8.0 read 412395458 rows in 83 groups in 00:00:19.1050095