codename-hub / php-parquet

PHP implementation for reading and writing Apache Parquet files/streams
Other
58 stars 8 forks source link

Issue when readColumn in Parquet file with large amount of data #6

Closed padi-pm-dungnt closed 2 years ago

padi-pm-dungnt commented 2 years ago

Hi, thanks for your great library.It works well with small parquet file, but when i tried to read data from Parquet file with ~500k row of data, array values from readColumn ->getData() become incorrect.

Here is my parquet file: https://dev-sc2-pn.s3.ap-northeast-1.amazonaws.com/sc2_area_master+(3).parquet

My parquet file has only 92 rows with project_id = '123456789012345678', but when i get data from colum getData(), it return more than 300k row with this project_id.

Here is my sample code.Do you have any idea about this issue?

// open parquet file reader
$parquetReader = new ParquetReader($fileStream);

// get file schema (available straight after opening parquet reader)
// however, get only data fields as only they contain data values
$dataFields = $parquetReader->schema->GetDataFields();

// enumerate through row groups in this file
for ($i = 0; $i < $parquetReader->getRowGroupCount(); $i++) {
    // create row group reader
    $groupReader = $parquetReader->OpenRowGroupReader($i);
    $rowCount = $groupReader->getRowCount();

    // read all columns inside each row group (you have an option to read only
    // required columns if you need to.
    $columns = [];
    foreach ($dataFields as $field) {
        $columns[] = @$groupReader->ReadColumn($field);
    }

    // $data member, accessible through ->getData() contains an array of column data
    $projectIds = $columns[0]->getData();

    dd($columns[0]->getData(0));
}
Katalystical commented 2 years ago

Hi there, Thanks for the report. I wanted to give a short message I've taken notice. I think it is indeed an issue and I'm investigating. Thankful for any hint. I think it has to do with RLE reading or a wrong offset. No ETA, I'm maintaining this in my spare time.

padi-pm-dungnt commented 2 years ago

Thanks for your reply.I think the reason is due to wrong offset as you said.Maybe something wrong happen in some of these functions:

Katalystical commented 2 years ago

@padi-pm-dungnt I've identified and fixed the issue and I'm preparing a new release. It was a PHP syntax misinterpretation with parentheses.

I'd like to strip down the test file to contain just one column (project_id) and include a test case in this project. Please confirm you are authorized to give permission and allow me to use respective information from your provided parquet file.

padi-pm-dungnt commented 2 years ago

@Katalystical Thanks for your update & looking forward for new release.