NREL / buildstockbatch

Other
22 stars 14 forks source link

BSB to Athena Datetime Encoding Corrupted #221

Closed T-Man-Stan closed 3 years ago

T-Man-Stan commented 3 years ago

Describe the bug

The year portion of the datetime string (e.g., "2012" in "2012-07-03") is being incorrectly encoded (e.g., "+43969-07-03") when dataframes are written as parquet files during postprocessing of buildstockbatch runs. These tables are written as parquet files and uploaded to S3/Athena in AWS.

When data is subsequently queried in the EEDR workflow the parser in the dateutil package barfs and produces the following error: "ParserError: Unknown string format: +43969-07-03"

Error appears to stem from when the "write_dataframe_as_parquet(df, fs, filename)" function is called in postprocessing.py. It seems that pyarrow changed the arguments for the "pq.write_table()" function (see links below).

https://arrow.apache.org/docs/python/parquet.html#storing-timestamps https://github.com/NREL/buildstockbatch/blob/0d5feb962b2bfb2a226e286e6d09600e92795e3e/buildstockbatch/postprocessing.py#L147-L150

Upgrading the Athena engine for our working group (eedr) in the AWS console "fixed" the error (i.e., "+43969-07-03" is now encoded as "2012-07-03"), however, it has caused another error in our querying that we're still looking into. Regardless, the code in the postprocessing.py script might need to be updated anyways.

Possible solution - something like:

pq.write_table(table, where, coerce_timestamps='ms', allow_truncated_timestamps=True)

OR

[from https://arrow.apache.org/docs/python/parquet.html#storing-timestamps]

"Older Parquet implementations use INT96 based storage of timestamps, but this is now deprecated. This includes some older versions of Apache Impala and Apache Spark. To write timestamps in this format, set the use_deprecated_int96_timestamps option to True in write_table"

To Reproduce I didn't run this workflow so I'm not totally sure, however, using the newest version of BSB and running the postprocessing script with test data will likely re-produce the behavior and this can then be viewed in Athena. @mleachNREL

Expected behavior "+43969-07-03" should be encoded as "2012-07-03" in the time column of our data for all rows and tables.

Platform (please complete the following information):

nmerket commented 3 years ago

@T-Man-Stan, here's the current state of things:

  1. Up until recently we had to store all our parquet files with this deprecated timeseries format because Spark and Athena liked them that way. It's the flavor='spark' stuff in postprocessing.py.
  2. In pyarrow 3.0, they changed that argument to something else, so newer outputs are being saved in the newer timeseries format that Athena doesn't like.
  3. Around the same time AWS released a new version of the Athena (prestodb) engine they're calling v2. This new version can read the new timestamp versions correctly.
  4. Yesterday I updated everyone's workgroups in Athena to use v2. (They were going to force the update eventually anyway.)
  5. Timestamps work again!

I think to "fix" this we remove the flavor='spark' and tell everyone to use Athena engine v2.

nmerket commented 3 years ago

In the mean time, your results will work if you just switch your workgroup to use Athena engine v2