flow-php / flow

Flow PHP - data processing framework
https://flow-php.com
MIT License
404 stars 23 forks source link

lib-parquet: how to convert numeric dates #1093

Closed sergio-paternoster closed 3 weeks ago

sergio-paternoster commented 3 weeks ago

Not really an issue but more a simple request for clarifications. I have a parquet file that contains several date columns (in Y-m-d format). When I loop through the rows, I see these dates coming up as integer values (for example, 16227 instead of 2014-06-06). Is there a way to convert them automatically? In case not, how can I correctly visualize the dates when I read the file? I tried the following but it looks not efficient at all (especially for files with millions of records). Thank you in advance!

$reader          = new Reader();
$file            = $reader->read("0000_part_00.parquet");
$schema         = $file->schema();

$column_metadata = [];
foreach ($schema->columnsFlat() as $column)
{
    $column_name                                            = $column->flatPath();
    $column_metadata[$column_name]['is_date']               = $column->convertedType()->value === ConvertedType::DATE->value;
}

// convert numeric date to formatted date
function convert_date(int $numeric_date, $format = 'Y-m-d') : string
{
    $timestamp  = $numeric_date * 86400;
    $date       = new DateTime("@$timestamp");
    return     $date->format($format);
}

// loop through all rows in the parquet file
foreach ($file->values() as $row)
{
    foreach($row as $column_name => $column_value)
    {
        echo "\n" . $column_name . " -> " . ($column_metadata[$column_name]['is_date'] ? convert_date($column_value) : $column_value);
    }
}
norberttech commented 3 weeks ago

hey! DateTimes should be automatically converted by the library to DateTimeImmutable objects, integers are just how parquet stores them. Let me ask you few questions first:

bin/parquet.php read:metadata path/to_your/file.parquet --columns --page-headers

and past me the output here?

This should give me a better view of your file structure so I might be able to identify the issue.

sergio-paternoster commented 3 weeks ago

I apologize, I had to specify it before. The parquet file comes from Amazon Redshift through the UNLOAD command. FYI, the mukunku/ParquetViewer sees the dates correctly formatted.

Please, find here attached the whole parquet file. It's very small. I use PHP 8.2.16 (cli) with snappy on Linux Ubuntu Server 22.0.4. Thank you!

0000_part_00.zip

norberttech commented 3 weeks ago

Hey, so it seems that Amazon Redshift is saving datatetimes just as INT32 with converted type Flow\Parquet\ParquetFile\Schema\ConvertedType::DATE.

Support for Converted Type according to parquet format is deprecated and afaik when I was working on this implementation I wasn't sure if this will be needed or not.

Give me a day or two and I will bring a support for converted type date to recognize them as date time objects properly.

Thanks for bringing this up and providing everything I needed to quickly identify the issue, it's a great contribution to this library!

norberttech commented 3 weeks ago

It was easier than I expected, once this is merged it should properly return you DateTimeImmutable objects

norberttech commented 3 weeks ago

It seems to be working properly now:

<?php

use Flow\Parquet\Reader;

require __DIR__ . '/../../vendor/autoload.php';

$reader = new Reader();

foreach ($reader->read(__DIR__ . '/0000_part_00.parquet')->values() as $row) {
    var_dump(\json_encode($row));die;
    die;
}

output (in json)

{"year":2014,"yearreportdate":{"date":"2014-06-06 00:00:00.000000","timezone_type":3,"timezone":"UTC"},"prioryearreportdate":{"date":"2013-06-08 00:00:00.000000","timezone_type":3,"timezone":"UTC"},"yeardayoffset":4294967294,"ltmstartdate":{"date":"2013-06-09 00:00:00.000000","timezone_type":3,"timezone":"UTC"},"ltmenddate":{"date":"2014-06-06 00:00:00.000000","timezone_type":3,"timezone":"UTC"},"currentfiscalyear":2024,"currentfiscalquarter":2,"currentfiscalquarterlabel":"24 Q2","currentfiscalquarterstartdate":{"date":"2024-03-31 00:00:00.000000","timezone_type":3,"timezone":"UTC"},"currentfiscalquarterenddate":{"date":"2024-06-29 00:00:00.000000","timezone_type":3,"timezone":"UTC"},"currentfiscalmonth":6,"currentfiscalquarterday":69,"currentfiscalquarterweek":10,"currentfiscalweek":23,"currentreportdate":{"date":"2024-06-07 00:00:00.000000","timezone_type":3,"timezone":"UTC"},"currentdayofyear":159,"yesterdaydate":{"date":"2024-06-07 00:00:00.000000","timezone_type":3,"timezone":"UTC"},"yesterdaydayofyear":159,"priorfiscalyear":2023,"priorfiscalyearquarterlabel":"23 Q2","priorfiscalquarterlabel":"24 Q1","nextfiscalquarterlabel":"24 Q3","cmrogflag":"N","currentforecastscenario":""}"

I'm closing this issue, in case of any problems don't hesitate to reach out again!

sergio-paternoster commented 3 weeks ago

It works! You're a fine gentleman, thank you! I'll keep testing Amazon Parquet files and come back in case of other questions.