Closed jpechane closed 4 years ago
oldkeys. However, the concept of row identification is wider than primary key (ref REPLICA IDENTITY). We can use the full row, an unique index, or primary key to identity a row.
I can use FULL IDENTITY
replica to get the orgiginal content of the row that is updated in oldkeys
. But in this case I am losing the primary key information. So providing a PK flag per field would allow me eat a cake and have it too.
I'm not following you. Why would you use FULL
? If you don't, primary key is used as replica identity, hence, oldkeys
is the primary key.
Please look at http://debezium.io/docs/connectors/postgresql/, section Update events
.
I'd like to have an event that would contain previoud values of the row before it was change and the new values. To get this I need to set REPLICA IDENTITY FULL
but in this case I will not be able to identify which colmuns forms primary key.
I read the docs but I'm not sure what kind of problem you are trying to solve. For updates, we provide key and new tuple. Do you want the old tuple too?
Yes, exactly!
@jpechane is there any interest in this feature?
@eulerto Hi, we still have a strong interest in this feature. Unfortunately our skillset blocks us from implementing and providing a PR
@eulerto, my company (Braintree) has actually started implementing this feature in an internal fork.
Our idea was to change the definition of oldkeys
slightly and introduce a new field called oldvalues
.
Here's an outline of the scenarios:
If replica identity is DEFAULT, FULL, or an index:
oldkeys
will be the primary key columns from the new tuple if a pk index existsoldvalues
will not be printedIf replica identity is DEFAULT:
oldkeys
will be the primary key columns from the old tuple if a pk index existsoldvalues
will not be printedIf replica identity is FULL:
oldkeys
will be the primary key columns from the old tuple if a pk index existsoldvalues
will be the entire old tuple with null values omittedIf replica identity is an index:
oldkeys
will not be printed, as the old tuple may not contain the old pk valuesoldvalues
will be the columns covered by the replica identity index from the old tuple with null values omittedIf replica identity is DEFAULT:
oldkeys
will be the primary key columns from the old tuple (note that if a pk index doesn't exist, we'll bail out early in the sanity checks and nothing will be outputted)oldvalues
will not be printedIf replica identity is FULL:
oldkeys
will be the primary key columns from the old tuple if a pk index existsoldvalues
will be the entire old tuple with null values omittedIf replica identity is an index:
oldkeys
will not be printed, as the old tuple may not contain the old pk valuesoldvalues
will be the columns covered by the replica identity index from the old tuple with null values omittedI'm sure there are some weird edge cases that I didn't mention, but in summary: oldkeys
will print primary key information and oldvalues
will print replica identity for updates / deletes iff the value differs from oldkeys
(mainly due to performance concerns).
To transition from DEFAULT to FULL, this approach makes a lot of sense for our use case, as it simply introduces a new field while oldkeys
remains unchanged.
I'd love to hear your thoughts on this proposal!
Thanks!
Hey @rkrage, did you ever end up publishing this work somewhere?
We've had some discussions about publishing our internal fork, but I'm not sure where we landed on it. I'll bring it up again and get back to you!
Also, we've been running this feature in production for the past several months with no issues 😃
Hey @rkrage, did you ever decide to publish your work? Also, does the feature work with format-version 2? I'm specifically interested in how you were able to get the entire tuple for DELETE records.
@deanrabinowitz4, we haven't been able to publish our work, unfortunately. But I didn't think we did anything special to get the entire tuple for DELETEs. I thought you just needed to set replica identity to full. As far as determining the PK when replica identity is full, I believe this is the code that inspired our work: https://github.com/postgres/postgres/blob/7551d9bc408c2402a8558367ee950ca403e25b37/contrib/tcn/tcn.c#L121-L139
Our fork doesn't currently support format version 2. We were hoping the open-source version would evolve to suit our needs, and it looks like it's getting close.
@eulerto, we had previously discussed exposing the PK in the version 2 proposal: https://github.com/eulerto/wal2json/pull/110#issuecomment-495681821
Is that still in the works? Thanks!
Since this is a popular request, I developed a patch to add primary key information (both formats). If I have some spare time to improve test coverage for this feature, I will commit it this week.
Done on commit 35113849acb24038a5e3f492358f407703b6b2d1
When the event is processed downstream it is important to know what columns forms the primary key. Could you please provide a flag for columns that says that it is a part of primary key?