Closed T-Man-Stan closed 3 years ago
@T-Man-Stan, here's the current state of things:
flavor='spark'
stuff in postprocessing.py.I think to "fix" this we remove the flavor='spark'
and tell everyone to use Athena engine v2.
In the mean time, your results will work if you just switch your workgroup to use Athena engine v2
Describe the bug
The year portion of the datetime string (e.g., "2012" in "2012-07-03") is being incorrectly encoded (e.g., "+43969-07-03") when dataframes are written as parquet files during postprocessing of buildstockbatch runs. These tables are written as parquet files and uploaded to S3/Athena in AWS.
When data is subsequently queried in the EEDR workflow the parser in the dateutil package barfs and produces the following error: "ParserError: Unknown string format: +43969-07-03"
Error appears to stem from when the "write_dataframe_as_parquet(df, fs, filename)" function is called in postprocessing.py. It seems that pyarrow changed the arguments for the "pq.write_table()" function (see links below).
https://arrow.apache.org/docs/python/parquet.html#storing-timestamps https://github.com/NREL/buildstockbatch/blob/0d5feb962b2bfb2a226e286e6d09600e92795e3e/buildstockbatch/postprocessing.py#L147-L150
Upgrading the Athena engine for our working group (eedr) in the AWS console "fixed" the error (i.e., "+43969-07-03" is now encoded as "2012-07-03"), however, it has caused another error in our querying that we're still looking into. Regardless, the code in the postprocessing.py script might need to be updated anyways.
Possible solution - something like:
pq.write_table(table, where, coerce_timestamps='ms', allow_truncated_timestamps=True)
OR
[from https://arrow.apache.org/docs/python/parquet.html#storing-timestamps]
"Older Parquet implementations use INT96 based storage of timestamps, but this is now deprecated. This includes some older versions of Apache Impala and Apache Spark. To write timestamps in this format, set the use_deprecated_int96_timestamps option to True in write_table"
To Reproduce I didn't run this workflow so I'm not totally sure, however, using the newest version of BSB and running the postprocessing script with test data will likely re-produce the behavior and this can then be viewed in Athena. @mleachNREL
Expected behavior "+43969-07-03" should be encoded as "2012-07-03" in the time column of our data for all rows and tables.
Platform (please complete the following information):