apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.61k stars 3.55k forks source link

Reading parquet files with timestamp column containing 9999-12-23 23:59:59 yields 1816-03-22 05:56:07.066277376 #44112

Open matthiasgomolka opened 2 months ago

matthiasgomolka commented 2 months ago

Describe the bug, including details regarding any error messages, version, and platform.

I've stumbled upon a weird issue, where I don't get the underlying isse.

I read a parquet file which contains a timestamp column. This timestamp column contains the value 9999-12-23 23:59:59. When I read this file using pyarrow (or with pandas and pyarrow engine an dtype_backend), the rows with 9999-12-23 23:59:59 show the value 1816-03-22 05:56:07.066277376.

I'm pretty certain that 9999-12-23 23:59:59 is the correct value, because this is much more plausible (and that's what duckdb and Impala say as well).

When I write the respective row to parquet using duckdb and read this file using pyarrow, I get the correct value of 9999-12-23 23:59:59.

I've already checked if this is a problem with the parquet version, but both files are version 1.0. What else might cause this?

Unfortunately, I can't share the parquet file in question because it contains confidential data.

Component(s)

Python

Michael-J-Ward commented 1 month ago

I came across your issue while researching my own timestamp[s] issue.

I suspect your issue stems from the same thing - parquet does not have a seconds timestamp type.

https://github.com/apache/arrow/issues/41382#issuecomment-2078658637

matthiasgomolka commented 1 month ago

I'm not sure. I mean, other parquet readers handle the identical file just fine.