Closed sergio-paternoster closed 5 months ago
hey! DateTimes should be automatically converted by the library to DateTimeImmutable objects, integers are just how parquet stores them. Let me ask you few questions first:
composer require flow-php/parquet-viewer
to your project and then use it in a following way: bin/parquet.php read:metadata path/to_your/file.parquet --columns --page-headers
and past me the output here?
This should give me a better view of your file structure so I might be able to identify the issue.
I apologize, I had to specify it before. The parquet file comes from Amazon Redshift through the UNLOAD command. FYI, the mukunku/ParquetViewer sees the dates correctly formatted.
Please, find here attached the whole parquet file. It's very small. I use PHP 8.2.16 (cli) with snappy on Linux Ubuntu Server 22.0.4. Thank you!
Hey, so it seems that Amazon Redshift is saving datatetimes just as INT32 with converted type Flow\Parquet\ParquetFile\Schema\ConvertedType::DATE
.
Support for Converted Type according to parquet format is deprecated and afaik when I was working on this implementation I wasn't sure if this will be needed or not.
Give me a day or two and I will bring a support for converted type date to recognize them as date time objects properly.
Thanks for bringing this up and providing everything I needed to quickly identify the issue, it's a great contribution to this library!
It was easier than I expected, once this is merged it should properly return you DateTimeImmutable objects
It seems to be working properly now:
<?php
use Flow\Parquet\Reader;
require __DIR__ . '/../../vendor/autoload.php';
$reader = new Reader();
foreach ($reader->read(__DIR__ . '/0000_part_00.parquet')->values() as $row) {
var_dump(\json_encode($row));die;
die;
}
output (in json)
{"year":2014,"yearreportdate":{"date":"2014-06-06 00:00:00.000000","timezone_type":3,"timezone":"UTC"},"prioryearreportdate":{"date":"2013-06-08 00:00:00.000000","timezone_type":3,"timezone":"UTC"},"yeardayoffset":4294967294,"ltmstartdate":{"date":"2013-06-09 00:00:00.000000","timezone_type":3,"timezone":"UTC"},"ltmenddate":{"date":"2014-06-06 00:00:00.000000","timezone_type":3,"timezone":"UTC"},"currentfiscalyear":2024,"currentfiscalquarter":2,"currentfiscalquarterlabel":"24 Q2","currentfiscalquarterstartdate":{"date":"2024-03-31 00:00:00.000000","timezone_type":3,"timezone":"UTC"},"currentfiscalquarterenddate":{"date":"2024-06-29 00:00:00.000000","timezone_type":3,"timezone":"UTC"},"currentfiscalmonth":6,"currentfiscalquarterday":69,"currentfiscalquarterweek":10,"currentfiscalweek":23,"currentreportdate":{"date":"2024-06-07 00:00:00.000000","timezone_type":3,"timezone":"UTC"},"currentdayofyear":159,"yesterdaydate":{"date":"2024-06-07 00:00:00.000000","timezone_type":3,"timezone":"UTC"},"yesterdaydayofyear":159,"priorfiscalyear":2023,"priorfiscalyearquarterlabel":"23 Q2","priorfiscalquarterlabel":"24 Q1","nextfiscalquarterlabel":"24 Q3","cmrogflag":"N","currentforecastscenario":""}"
I'm closing this issue, in case of any problems don't hesitate to reach out again!
It works! You're a fine gentleman, thank you! I'll keep testing Amazon Parquet files and come back in case of other questions.
Not really an issue but more a simple request for clarifications. I have a parquet file that contains several date columns (in Y-m-d format). When I loop through the rows, I see these dates coming up as integer values (for example, 16227 instead of 2014-06-06). Is there a way to convert them automatically? In case not, how can I correctly visualize the dates when I read the file? I tried the following but it looks not efficient at all (especially for files with millions of records). Thank you in advance!