HazyResearch / deepdive

DeepDive
deepdive.stanford.edu
1.96k stars 539 forks source link

Add timestamp conversion from Postgres dump to JSON #542

Closed shahin closed 8 years ago

shahin commented 8 years ago

The pgtsv_to_json module is the fallback for database systems that don't support to_json(). It's called when I run on Greenplum, but I have timestamps in my table. Without this commit, the module terminates on a ValueError. With this commit, timestamp input columns are included in the JSON output.

netj commented 8 years ago

I meant we can set the time zone in the psql that feeds input to the script such that the UTC offset we add is going to be always correct. The script doesn't have be general since pgtsv_to_json is really just a shim for working around a limitation in some db-drivers and not intended for general use outside of it.

shahin commented 8 years ago

@alldefector Good suggestion. I tried passing the timestamp string through unchanged, but jq can't handle that particular format. This suggests to me that other JSON tools won't either, even if ES does.

@netj Our timestamps are time zone naive and don't have offsets (they use Postgres' timestamp without time zone type), so there is no time zone setting in psql that can result in a correct offset down the line. We could change these columns to timestamp with time zone but now we're asking the application to support time zones for a really obscure technical reason.

I pushed 0a794ff in an attempt to resolve this. Before this change, pgtsv_to_json didn't support any timestamps. After this change, pgtsv_to_json explicitly supports the timestamp without time zone type without qualification. Looks to me like jq and Elastic support the output without changes elsewhere.

Please let me know if you think it produces inaccurate results, or if it doesn't support a feature that we need!

netj commented 8 years ago

I don't understand why anyone should ever want to use timestamp without time zone (it doesn't make sense as it's a reference to an absolute point in time and not subject to interpretation). http://stackoverflow.com/a/6158432

It's fine to use any time zone in the JSON output as long as we are consistent with the data source. Since pgtsv to json is used only once in our codebase https://github.com/HazyResearch/deepdive/search?utf8=✓&q=pgtsv_to_json, we can just put the following line before the COPY statement to ensure we get timestamps serialized in a certain tz

SET timezone = 'UTC';

Not sure why pgtsv to json needs any change from the first version. This is a problem in the data source not the data encoder.

BTw JSON defines one and only one standard datetime format: ISO8601 which is also used by XML Schema, etc. so i agree we shouldn't produce some nonstandard format. It's no longer JSON if we try to sneak in some weird date time format.

shahin commented 8 years ago

@netj I agree, it's weird and certainly incorrect to leave out time zone info! Unfortunately adding time zone support to our app is way beyond the scope of my current task (moving specific processing from Postgres to Greenplum).

I don't think I made the motivation clear in my initial PR, and my original patch was definitely flawed. Sorry about that! The motivation for this PR is twofold:

  1. pgtsv_to_json terminates on a ValueError when dumping timestamps of any kind, and
  2. jq crashes when we send it psql-formatted timestamp strings, which are missing the "T" character and as you pointed out are not valid ISO 8601.

Seems like setting psql's time zone to UTC for this one dump doesn't fix those problems, but does make our timestamps incorrect until we support time zones from the very beginning of the pipeline. As it stands the data source (PostgreSQL/Greenplum) doesn't contain time zone information, and no combination of psql runtime directives can create that information out of thin air.

Would you be open to my approach in 0a794ff along with a promise to open a ticket for proper time zone handling in downstream apps? I'd love to fix that at some point.

alldefector commented 8 years ago

@shahin Related: The missing 'T' issue is a bug in postgres's to_json that was fixed only in 9.5... http://postgresql.nabble.com/BUG-12578-row-to-json-and-to-json-add-T-in-timestamp-field-td5834396.html

netj commented 8 years ago

Ok make sense now. Keeping a way to pass thru incomplete data may be something useful too anyway.

alldefector commented 8 years ago

LGTM

shahin commented 8 years ago

@alldefector Looks like CI is failing, but I don't think it's caused by this diff. Can I merge and push on the command line anyway? Should I open a ticket for fixing CI here?