Closed shahin closed 8 years ago
I meant we can set the time zone in the psql that feeds input to the script such that the UTC offset we add is going to be always correct. The script doesn't have be general since pgtsv_to_json is really just a shim for working around a limitation in some db-drivers and not intended for general use outside of it.
@alldefector Good suggestion. I tried passing the timestamp string through unchanged, but jq
can't handle that particular format. This suggests to me that other JSON tools won't either, even if ES does.
@netj Our timestamps are time zone naive and don't have offsets (they use Postgres' timestamp without time zone
type), so there is no time zone setting in psql that can result in a correct offset down the line. We could change these columns to timestamp with time zone
but now we're asking the application to support time zones for a really obscure technical reason.
I pushed 0a794ff in an attempt to resolve this. Before this change, pgtsv_to_json
didn't support any timestamps. After this change, pgtsv_to_json
explicitly supports the timestamp without time zone
type without qualification. Looks to me like jq
and Elastic support the output without changes elsewhere.
Please let me know if you think it produces inaccurate results, or if it doesn't support a feature that we need!
I don't understand why anyone should ever want to use timestamp without time zone (it doesn't make sense as it's a reference to an absolute point in time and not subject to interpretation). http://stackoverflow.com/a/6158432
It's fine to use any time zone in the JSON output as long as we are consistent with the data source. Since pgtsv to json is used only once in our codebase https://github.com/HazyResearch/deepdive/search?utf8=✓&q=pgtsv_to_json, we can just put the following line before the COPY statement to ensure we get timestamps serialized in a certain tz
SET timezone = 'UTC';
Not sure why pgtsv to json needs any change from the first version. This is a problem in the data source not the data encoder.
BTw JSON defines one and only one standard datetime format: ISO8601 which is also used by XML Schema, etc. so i agree we shouldn't produce some nonstandard format. It's no longer JSON if we try to sneak in some weird date time format.
@netj I agree, it's weird and certainly incorrect to leave out time zone info! Unfortunately adding time zone support to our app is way beyond the scope of my current task (moving specific processing from Postgres to Greenplum).
I don't think I made the motivation clear in my initial PR, and my original patch was definitely flawed. Sorry about that! The motivation for this PR is twofold:
pgtsv_to_json
terminates on a ValueError when dumping timestamps of any kind, andjq
crashes when we send it psql-formatted timestamp strings, which are missing the "T" character and as you pointed out are not valid ISO 8601.Seems like setting psql's time zone to UTC for this one dump doesn't fix those problems, but does make our timestamps incorrect until we support time zones from the very beginning of the pipeline. As it stands the data source (PostgreSQL/Greenplum) doesn't contain time zone information, and no combination of psql runtime directives can create that information out of thin air.
Would you be open to my approach in 0a794ff along with a promise to open a ticket for proper time zone handling in downstream apps? I'd love to fix that at some point.
@shahin Related: The missing 'T' issue is a bug in postgres's to_json
that was fixed only in 9.5...
http://postgresql.nabble.com/BUG-12578-row-to-json-and-to-json-add-T-in-timestamp-field-td5834396.html
Ok make sense now. Keeping a way to pass thru incomplete data may be something useful too anyway.
LGTM
@alldefector Looks like CI is failing, but I don't think it's caused by this diff. Can I merge and push on the command line anyway? Should I open a ticket for fixing CI here?
The
pgtsv_to_json
module is the fallback for database systems that don't supportto_json()
. It's called when I run on Greenplum, but I have timestamps in my table. Without this commit, the module terminates on a ValueError. With this commit, timestamp input columns are included in the JSON output.