codename-hub / php-parquet

PHP implementation for reading and writing Apache Parquet files/streams
Other
58 stars 8 forks source link

DateTimeDataField subtracts 1 day in case of non UTC date #21

Closed japaveh closed 8 months ago

japaveh commented 9 months ago

First of all, I am a very happy user of this code package, so thanks for the effort.

However, I figured out that the DateTimeDataField mis-calculates the date value in case a PHP DateTime object is given which is not UTC. For example in my dataset I have a DateTime field:

^ DateTimeImmutable @1664575200 {#861
  date: 2022-10-01 00:00:00.0 Europe/Amsterdam (+02:00)
}

Then the amount of unix days calculated is (by dumping https://github.com/codename-hub/php-parquet/blob/master/src/data/concrete/DateTimeOffsetDataTypeHandler.php#L161)

19265

But this value leads to 2022-09-30 in the parquet result, which is one day off

When I convert the date to a UTC time it is correct.

^ DateTimeImmutable @1664582400 {#860
  date: 2022-10-01 00:00:00.0 UTC (+00:00)
}

19266

Now the dates in the parquet files are also correct

This is the root cause of the issue, i guess the code takes the diff with 2 times of 2 timezones but PHP internally matches the timezones first and then the data shifts 1 day when moving from CEST to UTC. I tried to reproduce the issue here:

https://3v4l.org/24iPo#v8.2.6

It is visible that the reconstructured data is 1 day earlier due to the date offset.

Katalystical commented 8 months ago

From a first, quick look, I guess you should not be relying on timezones for pure dates (w/o time portion). Information is likely to get lost.

The parquet spec states: DATE is used to for a logical date type, without a time of day. It must annotate an int32 that stores the number of days from the Unix epoch, 1 January 1970.

Your example of 2022-10-01+02:00 is in fact the same as 2022-09-30+00:00 and therefore, is 19265 full unix days since 1970-01-01.

From running a quick test with it, this case is inducing challenges in unit tests, because the existing comparisons/assertions do not cover date-level comparisons and the regular PHP date diff mechanics usually try to incorporate hours, minutes, etc. Those lead to a difference of 22 hours in your case, because hours, etc. will get stripped off/set to 0 after doing a parquet write-read.

I don't see a real issue here, except possibly adding a remark/usage note. Feel free to re-open if I didn't get your point correctly.