codename-hub / php-parquet

PHP implementation for reading and writing Apache Parquet files/streams
Other
58 stars 8 forks source link

ParquetReader - Get Value #4

Closed noxify closed 2 years ago

noxify commented 2 years ago

Hi,

thanks for this package - I'm currently testing different parquet library in different languages to check which one could be our replacement for the current flask implementation.

It seems that your package has no problems with reading our packages ( the TS Parquet Package seems to support only Parquet 2.0 ).

I created quickly a laravel app to test it ( i have some other ideas, but the main feature should work before I start the developing ;) )

The snippet is the following:

$parquetPath = Storage::path('path/to/parquetfile.parquet');

    $parquetStream = fopen($parquetPath, 'r');

    $parquetReader = new ParquetReader($parquetStream);

    $dataFields = $parquetReader->schema->GetDataFields();

    $result = [];

    for ($i = 0; $i < $parquetReader->getRowGroupCount(); $i++) {
        // create row group reader
        $groupReader = $parquetReader->OpenRowGroupReader($i);
        // read all columns inside each row group (you have an option to read only
        // required columns if you need to.
        $columns = [];
        foreach ($dataFields as $field) {
            $column = $groupReader->ReadColumn($field);
            $columns[$column->getField()->name] = $column->getData();
        }

        $result[] = $columns;
    }

    dd($result);

$result shows me the correct columns, but the column value which I got via getData is always the binary.

I checked the code, but wasn't able to find the relevant part to convert it back to the readable value.

Maybe you could give me a hint how to do this.

Thanks!

noxify commented 2 years ago

It seems it was too late yesterday.

Checked the result again and yeah... I got already the data in the correct format.

Here the updated snippet which returns the correct array of object:

    $parquetPath = Storage::path('path/to/parquetfile.parquet');
    $parquetStream = fopen($parquetPath, 'r');

    $parquetReader = new ParquetReader($parquetStream);

    $dataFields = $parquetReader->schema->GetDataFields();

    $result = [];

    for ($i = 0; $i < $parquetReader->getRowGroupCount(); $i++) {
        // create row group reader
        $groupReader = $parquetReader->OpenRowGroupReader($i);

        // read all columns inside each row group (you have an option to read only
        // required columns if you need to.
        foreach ($dataFields as $field) {
            $column = $groupReader->ReadColumn($field);
            $columnData = $column->getData();
            for($di=0; $di< count($columnData);$di++) {
                $result[$di][$column->getField()->name] = $columnData[$di];
            }
        }
    }

    return $result;