airbytehq / airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
https://airbyte.com
Other
15.53k stars 4k forks source link

✨ Source: Postgres - CDC support for TOAST columns #30518

Open tybernstein opened 1 year ago

tybernstein commented 1 year ago

What area the feature impact?

Connectors

Revelant Information

Currently the Postgres Source Connector delivers __debezium_unavailable_value when syncing TOAST columns. We should instead deliver the actual content of the columns.

mackro-rocky commented 11 months ago

Not being able to sync toast columns seems like a fairly high priority issue no?

Reidsy commented 1 month ago

Currently we are seeing __debezium_unavailable_value for json data types, we are also seeing java.lang.Object@2c5e36a7 (last 8 digits for memory location vary) for varchar data types. These values are littered through the data output which makes it difficult to understand the scope of the problem.

This occurs on any row with a TOAST column. If the value in the TOAST column is changed, the field is updated correctly. If a column other than TOAST column is updated, the field is incorrectly updated to __debezium_unavailable_value or java.lang.Object@xxxxxxxx depending on the data type.

Agree with the above comment, this seems fairly high priority as any records with TOAST columns will be affected.

tanderson-hp commented 1 month ago

@Reidsy not sure if this is an option for you, but we were experiencing the same thing with both the debezium unavailable's and the java.langs, and solved the problem by changing the replica identity in postgres to FULL for any tables that had TOAST columns.

This article (https://debezium.io/blog/2019/10/08/handling-unchanged-postgres-toast-values/) from debezium seemed to help us understand the problem better/lead us to the solution.

enkeboll commented 1 month ago

@tanderson-hp when you changed to FULL for the affected tables, did you notice your Airbyte usage going up? Since AB bills on volume and not row count, I'm worried about this spiking usage on what's already one of our most expensive tables to sync