Provide a primary key flag for columns

jpechane commented 7 years ago

When the event is processed downstream it is important to know what columns forms the primary key. Could you please provide a flag for columns that says that it is a part of primary key?

eulerto commented 7 years ago

oldkeys. However, the concept of row identification is wider than primary key (ref REPLICA IDENTITY). We can use the full row, an unique index, or primary key to identity a row.

jpechane commented 7 years ago

I can use FULL IDENTITY replica to get the orgiginal content of the row that is updated in oldkeys. But in this case I am losing the primary key information. So providing a PK flag per field would allow me eat a cake and have it too.

eulerto commented 7 years ago

I'm not following you. Why would you use FULL? If you don't, primary key is used as replica identity, hence, oldkeys is the primary key.

jpechane commented 7 years ago

Please look at http://debezium.io/docs/connectors/postgresql/, section Update events. I'd like to have an event that would contain previoud values of the row before it was change and the new values. To get this I need to set REPLICA IDENTITY FULL but in this case I will not be able to identify which colmuns forms primary key.

eulerto commented 7 years ago

I read the docs but I'm not sure what kind of problem you are trying to solve. For updates, we provide key and new tuple. Do you want the old tuple too?

jpechane commented 7 years ago

Yes, exactly!

eulerto commented 6 years ago

@jpechane is there any interest in this feature?

jpechane commented 6 years ago

@eulerto Hi, we still have a strong interest in this feature. Unfortunately our skillset blocks us from implementing and providing a PR

rkrage commented 6 years ago

@eulerto, my company (Braintree) has actually started implementing this feature in an internal fork.

Our idea was to change the definition of oldkeys slightly and introduce a new field called oldvalues.

Here's an outline of the scenarios:

Insert

If replica identity is DEFAULT, FULL, or an index:

oldkeys will be the primary key columns from the new tuple if a pk index exists
oldvalues will not be printed

Update

If replica identity is DEFAULT:

oldkeys will be the primary key columns from the old tuple if a pk index exists
oldvalues will not be printed

If replica identity is FULL:

oldkeys will be the primary key columns from the old tuple if a pk index exists
oldvalues will be the entire old tuple with null values omitted

If replica identity is an index:

oldkeys will not be printed, as the old tuple may not contain the old pk values
oldvalues will be the columns covered by the replica identity index from the old tuple with null values omitted

Delete

If replica identity is DEFAULT:

oldkeys will be the primary key columns from the old tuple (note that if a pk index doesn't exist, we'll bail out early in the sanity checks and nothing will be outputted)
oldvalues will not be printed

If replica identity is FULL:

oldkeys will be the primary key columns from the old tuple if a pk index exists
oldvalues will be the entire old tuple with null values omitted

If replica identity is an index:

oldkeys will not be printed, as the old tuple may not contain the old pk values
oldvalues will be the columns covered by the replica identity index from the old tuple with null values omitted

I'm sure there are some weird edge cases that I didn't mention, but in summary: oldkeys will print primary key information and oldvalues will print replica identity for updates / deletes iff the value differs from oldkeys (mainly due to performance concerns).

To transition from DEFAULT to FULL, this approach makes a lot of sense for our use case, as it simply introduces a new field while oldkeys remains unchanged.

I'd love to hear your thoughts on this proposal!

Thanks!

gunnarmorling commented 5 years ago

Hey @rkrage, did you ever end up publishing this work somewhere?

rkrage commented 5 years ago

We've had some discussions about publishing our internal fork, but I'm not sure where we landed on it. I'll bring it up again and get back to you!

Also, we've been running this feature in production for the past several months with no issues 😃

deanrabinowitz4 commented 4 years ago

Hey @rkrage, did you ever decide to publish your work? Also, does the feature work with format-version 2? I'm specifically interested in how you were able to get the entire tuple for DELETE records.

rkrage commented 4 years ago

@deanrabinowitz4, we haven't been able to publish our work, unfortunately. But I didn't think we did anything special to get the entire tuple for DELETEs. I thought you just needed to set replica identity to full. As far as determining the PK when replica identity is full, I believe this is the code that inspired our work: https://github.com/postgres/postgres/blob/7551d9bc408c2402a8558367ee950ca403e25b37/contrib/tcn/tcn.c#L121-L139

Our fork doesn't currently support format version 2. We were hoping the open-source version would evolve to suit our needs, and it looks like it's getting close.

@eulerto, we had previously discussed exposing the PK in the version 2 proposal: https://github.com/eulerto/wal2json/pull/110#issuecomment-495681821

Is that still in the works? Thanks!

eulerto commented 4 years ago

Since this is a popular request, I developed a patch to add primary key information (both formats). If I have some spare time to improve test coverage for this feature, I will commit it this week.

eulerto commented 4 years ago

Done on commit 35113849acb24038a5e3f492358f407703b6b2d1

eulerto / wal2json