codename-hub / php-parquet

PHP implementation for reading and writing Apache Parquet files/streams
Other
58 stars 8 forks source link

Reading Row Groups #9

Closed norberttech closed 2 years ago

norberttech commented 2 years ago

Hey Based on this library, I'm trying to implement a parquet adapter for Flow PHP. I started from writing few tests (row groups are that small just for testing purpose), you can find code below:

Code Example

```php CreateRowGroup(); $rowGroup->WriteColumn(new DataColumn($id, [1, 2, 3, 4])); $rowGroup->finish(); $writer->finish(); $writer = new ParquetWriter($schema, \fopen(__DIR__.'/test.parquet', 'a+'), null, true); $rowGroup = $writer->CreateRowGroup(); $rowGroup->WriteColumn(new DataColumn($id, [5, 6, 7, 8])); $rowGroup->finish(); $writer->finish(); $writer = new ParquetWriter($schema, \fopen(__DIR__.'/test.parquet', 'a+'), null, true); $rowGroup = $writer->CreateRowGroup(); $rowGroup->WriteColumn(new DataColumn($id, [9, 10, 11, 12])); $rowGroup->finish(); $writer->finish(); $writer = new ParquetWriter($schema, \fopen(__DIR__.'/test.parquet', 'a+'), null, true); $rowGroup = $writer->CreateRowGroup(); $rowGroup->WriteColumn(new DataColumn($id, [13, 14, 15, 16])); $rowGroup->finish(); $writer->finish(); ```

But when I tried to read that using parquet-tools I'm getting following error:

parquet-tools cat --json test.parquet
{"id":1}
{"id":2}
{"id":3}
{"id":4}
java.io.IOException: can not read class org.apache.parquet.format.PageHeader: Socket is closed by peer.

I also tried to check the file content through avro parquet viewer and I'm getting this:

image

Any idea what might be wrong here? If you could point me in the right direction, I can debug this issue further because I'm not that familiar with parquet format so any help is welcome.

Thanks for all your work to make parquet available in PHP!

Katalystical commented 2 years ago

I think you're using the wrong mode for appending (a+). As the PHP docs state, the read pointer is freely seekable, but the write pointer always starts at the end of the file. This way, you get an invalid Parquet file with multiple PAR1-headers.

In append mode, the footer (the essential part of the Parquet file) is removed and re-written when you finish the file.

You can simply use regular read-write mode w/o truncating (r+), the library will handle the append mechanism itself.

I think I might include an additional safety check when trying to write to a file opened in an inappropriate mode.

Looking forward to whats coming! Btw. as I saw your projects: the additional methods in my upcoming feature branch might help writing memory-efficient code related to interacting with Parquet files (stream-like reading, as far as possible), as well as full support of complex nested data schemas.

norberttech commented 2 years ago

Right 🤦 r+ solves the problem, thanks!

I think I might include an additional safety check when trying to write to a file opened in an inappropriate mode.

Yeah, a+ and $append = true could throw an exception suggesting r+.

Btw. as I saw your projects: the additional methods in my upcoming feature branch might help writing memory-efficient code related to interacting with Parquet files (stream-like reading, as far as possible), as well as full support of complex nested data schemas.

Awesome, memory-efficient writes would be more than good to have! At this point, I'm planning to let users configure the number of DataFrame rows per Row Group, not ideal but in case of high memory consumption reducing this number should help.

Unless you can think about a better way to control the size of each Row Group?