datamill-co / target-postgres

A Singer.io Target for Postgres
MIT License
101 stars 76 forks source link

Save properties not found in the jsonschema in a JSONB column #129

Open airhorns opened 5 years ago

airhorns commented 5 years ago

It makes a lot of sense that target-postgres tries to load data in a structured way into Postgres. The more that the database knows about the structure of the data, the faster and more efficient the database can be when querying it.

That said, lots of the singer taps don't or can't know the exact, specific structure of the data they are extracting before hand. Stitch itself deals with this in a variety of ways, but it does indeed deal with it some how. In my experience feeding web analytics event data from something like Segment into Redshift using Stitch, the event properties can be named whatever the heck developers want, and Stitch somewhat magically adds columns each time a new property is discovered. This is nice because no schema has to be updated anywhere and the pipeline doesn't break while someone figures out where to update it.

target-postgres doesn't need to be that smart, after all, you could just use Stitch, but, I think it could do a bit better than just dropping unknown columns data (see #127).

I think a nice middle ground would be to keep using the structured smarts that can create nice tables with good relational structure for things specified by the JSONSchema, but then loading unrecognized columns into a JSONB column called additional_properties or something like that. This would allow us to at least get a handle on the data from Postgres land, and if you want to break it out in to a real column or do something useful with it you could do so with a transform or SQL view or whatever. JSONB support is really quite nice in recent Postgres with all the partial indexing support and whatnot, and while it wouldn't be as fast as "real" columns, it can get close.

I think this may be undesirable behaviour for some users as well though -- they may be purposefully ignoring bad quality data in some columns, or discarding it because it's simply way too big. That's ok, and Singer's catalog functionality has mechanisms to ignore stuff that those people could use, and we should likely add a configuration option to disable this functionality in the event that folks don't want it. But, if I've learned anything as a pipeline developer, it's that that data is probably there because someone thought it was important, so being able to capture it and save it somehow would be great.

Sadly, JSONB support only landed in Postgres 9.4, which is much more recent than the 8.4.22 version mentioned in DECISIONS.md. 8.4.22 is 5 years old and unsupported by the Postgres community at this point, where as 9.4 is supported. So, we could consider upgrading the minimum required Postgres version, but that seems like a bit much for a net-new feature that people have been getting by without so far. So, I'd say that it'd be best to just detect JSONB support, and warn if the config option is not off in its absence or something like that. Is there precedent for that kind of behaviour in the project?

I will prepare a PR if y'all are into it!

awm33 commented 5 years ago

@airhorns Stitch does handle pretty much anything you can place in JSON. The engine behind that is actually pretty complex (think of handle type changes on nested records, etc) and closed source.

Mixpanel and some other SaaS sources to come to mind, you can't always know the schema.

A JSONB column on objects with additionalProperties marked true in the schema is interesting. I think we'd want to make it an option to turn on in the config. Maybe making the column __sdc_additional_properties to follow the __sdc convention.

The complications come from:

AlexanderMann commented 5 years ago

Weighing in, maybe a simpler approach here would be what do we want this to do ultimately?

We have some desire to persist data forward into our target so as to make sure nothing gets lost along the way right? __sdc_additional_properties seems quite useful, and I'm thinking that the simple solution is to have it default to TEXT and then use JSONB if it's available (etc.).

My questions would mainly be around:

Upsert for a row

In this instance do we want to try and deep merge the value which is present in the remote? If we are unable due to something like a list vs an object vs a scalar, do we overwrite with latest?

Do we want to wipe out the column on each update and simply use the latest payload each time?

Upserting the Schema

Dropping a column

Our schema declared value foo and now doesn't: Do we move the data over to our JSONB column?

Our schema declared value foo and now doesn't, but our payloads still do. Do we want to drop the column and then start persisting the values into the JSONB column, or do we want to leave the column around?

Adding a column

Schemas did not contain field foo but our payloads did. Our schemas now contain field foo. Do we want to "migrate" over the values present in the JSONB column?

What does Stitch do?

Well, aside from some of the "magic" which finds missing fields etc., it does also have an "errors" table it uses for debugging. This table ends up having a link to the source table, and source pk, and then the payload.

We could do something like this with a table which has text columns etc.

Proposal

I think it'd be pretty straightforward to add a flag to the config to allow persisting things into a "bonus" column.

target-postgres and target-redshift have a good deal of login in place for something like this. The main "trick" will be getting csv uploads to not fight with the JSON payloads etc. (I think this will be fine).

As far as logic, I'd vote for a dumb "latest-write-wins" scenario, where whatever comes along last gets put into the bonus column. If this means that there actually aren't any additional properties on a payload, then it'd be an empty dict.

AlexanderMann commented 5 years ago

So @airhorns, in my refactoring/future thinking today, I started piecing together what a pr for your above would look like, then I went a step further and started thinking through schema inspection.

Interestingly, that functionality is something which relies upon our batching logic herein, but not really a lot else. You could also make the argument that it relies upon our denesting logic.

For both of those things, they are distinctly different from the role of the target itself. ie, they make life simpler for the code which is ultimately concerned with persisting data to remote. Due to to this and some ideas I've had brewing for a while, I started to think of transformations to the messages being output by taps as pipes.

Ideally, what we're talking about here with schema inspection, is something which would transform a table_batch (a set of records the target should persist which has been denested) into another table_batch only updating the schemas associated.

To do this, I created a draft PR https://github.com/AlexanderMann/target-postgres/pull/29 which starts by ripping out a bunch of the stream transformation logic we already have and turns it into a series of Python Generators to make the stream processing nature of everything as apparent as possible. (this already sort of happens when we represent the input stream as an iterator using io.TextIOWrapper)

If you get a chance, I'd appreciate other eyes on the general concept of the code changes.

If this is more or less 👍 then adding in a function which takes table_batches and inspects each record becomes pretty easy, and also becomes something which we can possibly turn into another command, ie:

tap-hubspot | pipe-inspect-schema | target-postgres >> state.json