Closed japaveh closed 8 months ago
From a first, quick look, I guess you should not be relying on timezones for pure dates (w/o time portion). Information is likely to get lost.
The parquet spec states: DATE is used to for a logical date type, without a time of day. It must annotate an int32 that stores the number of days from the Unix epoch, 1 January 1970.
Your example of 2022-10-01+02:00 is in fact the same as 2022-09-30+00:00 and therefore, is 19265 full unix days since 1970-01-01.
From running a quick test with it, this case is inducing challenges in unit tests, because the existing comparisons/assertions do not cover date-level comparisons and the regular PHP date diff mechanics usually try to incorporate hours, minutes, etc. Those lead to a difference of 22 hours in your case, because hours, etc. will get stripped off/set to 0 after doing a parquet write-read.
I don't see a real issue here, except possibly adding a remark/usage note. Feel free to re-open if I didn't get your point correctly.
First of all, I am a very happy user of this code package, so thanks for the effort.
However, I figured out that the DateTimeDataField mis-calculates the date value in case a PHP DateTime object is given which is not UTC. For example in my dataset I have a DateTime field:
Then the amount of unix days calculated is (by dumping https://github.com/codename-hub/php-parquet/blob/master/src/data/concrete/DateTimeOffsetDataTypeHandler.php#L161)
19265
But this value leads to 2022-09-30 in the parquet result, which is one day off
When I convert the date to a UTC time it is correct.
19266
Now the dates in the parquet files are also correct
This is the root cause of the issue, i guess the code takes the diff with 2 times of 2 timezones but PHP internally matches the timezones first and then the data shifts 1 day when moving from CEST to UTC. I tried to reproduce the issue here:
https://3v4l.org/24iPo#v8.2.6
It is visible that the reconstructured data is 1 day earlier due to the date offset.