SlideRuleEarth / sliderule

Server and client framework for on-demand science data processing in the cloud
https://slideruleearth.io
Other
27 stars 12 forks source link

pyarrow casting error when saving/writing parquet file generated by sliderule #384

Closed elidwa closed 4 months ago

elidwa commented 6 months ago
import pyarrow.parquet as pq

in_file = "atl06.parquet"
out_file = "atl06modified.parquet"

table = pq.read_table(in_file)
...
pq.write_table(table, 'out_file')

Writing the table generates an error:

pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[us] would lose data: 1552556446894008576

Setting use_deprecated_int96_timestamps is a workaround.

pq.write_table(table, 'out_file', use_deprecated_int96_timestamps=True)

From docs:

Setting use_deprecated_int96_timestamps=True when writing the Parquet file tells PyArrow to store timestamp data using the INT96 data type, which can represent nanosecond precision and is often used for compatibility with Apache Spark.

We need ns precision for time. Not sure if this can be fixed in the table itself. Documenting it for now.