flow-php / flow

Flow PHP - data processing framework
https://flow-php.com
MIT License
404 stars 23 forks source link

AWS Redshift parquet file: handle negative numbers #1095

Closed sergio-paternoster closed 3 weeks ago

sergio-paternoster commented 3 weeks ago

0000_part_00.zip Hi Norbert,

in the attached parquet file (from AWS Redshift) you now fixed the Data issue. Thank you so much! I now see a possible issue with negative numbers. As you can see from this code, "yeardayoffset" is 4294967294 while is supposed to be -2. By the way, 4294967294 appears to be the max 32-bit integer minus 1. Could this be related?

<?php

use Flow\Parquet\Reader;

require __DIR__ . '/../../vendor/autoload.php';

$reader = new Reader();

foreach ($reader->read(__DIR__ . '/0000_part_00.parquet')->values() as $row) {
    var_dump(\json_encode($row));die;
    die;
}

{"year":2014,"yearreportdate":{"date":"2014-06-06 00:00:00.000000","timezone_type":3,"timezone":"UTC"},"prioryearreportdate":{"date":"2013-06-08 00:00:00.000000","timezone_type":3,"timezone":"UTC"},"yeardayoffset":4294967294,"ltmstartdate":{"date":"2013-06-09 00:00:00.000000","timezone_type":3,"timezone":"UTC"},"ltmenddate":{"date":"2014-06-06 00:00:00.000000","timezone_type":3,"timezone":"UTC"},"currentfiscalyear":2024,"currentfiscalquarter":2,"currentfiscalquarterlabel":"24 Q2","currentfiscalquarterstartdate":{"date":"2024-03-31 00:00:00.000000","timezone_type":3,"timezone":"UTC"},"currentfiscalquarterenddate":{"date":"2024-06-29 00:00:00.000000","timezone_type":3,"timezone":"UTC"},"currentfiscalmonth":6,"currentfiscalquarterday":69,"currentfiscalquarterweek":10,"currentfiscalweek":23,"currentreportdate":{"date":"2024-06-07 00:00:00.000000","timezone_type":3,"timezone":"UTC"},"currentdayofyear":159,"yesterdaydate":{"date":"2024-06-07 00:00:00.000000","timezone_type":3,"timezone":"UTC"},"yesterdaydayofyear":159,"priorfiscalyear":2023,"priorfiscalyearquarterlabel":"23 Q2","priorfiscalquarterlabel":"24 Q1","nextfiscalquarterlabel":"24 Q3","cmrogflag":"N","currentforecastscenario":""}"

The parquet metadata of this column is:

 {
        "Path": "yeardayoffset",
        "Type": "INT32",
        "TypeLength": null,
        "LogicalType": null,
        "RepetitionType": "OPTIONAL",
        "ConvertedType": "INT_16",
        "Children": []
      }
norberttech commented 3 weeks ago

thanks for reporting, it looks that reader is reading that column as INT32 but due to converted type it should actually convert (cut) it to in16, I will take a look into this shortly

norberttech commented 3 weeks ago

This one turned out to be 2 issues:

Once this is merged you should get correct numbers

sergio-paternoster commented 3 weeks ago

Wow! I see you had to modify 5 files and add an additional one. Impressed! I tested it and it works fine!! Thank you!!