Closed subodh1810 closed 3 years ago
Next steps:
Hi @sherifnada and @subodh1810 ,
As you know I was able to reproduce this OOM error locally. I've tried to fix this issue with Debezium configuration parameters, unfortunately it didn't help, so after deeper investigation I found out that in order to process big JSON blobs it is recommended to use wal2json plugin instead of pgoutput plugin. I've changed configuration to wal2json plugin and replication was successful.
So I would propose to add possibility for customers to select plugin type. Usage of the wal2json plugin requires additional installations on customer side, so we should indicate that it's recommended for the big JSON blobs replication.
Please let me know what you think about this solution and I'll start with implementation.
@irynakruk can you explain what was the root cause behind the OOM? Also changing from pgoutput to wal2json seems like a big task. Will it be backward compatible (will existing connectors continue to sync data if we make a change from pgoutput to wal2json)? Also cant we reduce the queue size here to tackle this?
@subodh1810 sure, I shared my steps with reproducing OOM below.
I red that wal2json plugin had similar problem with OOM, which was fixed, so I tried it and replication was successful.
And regarding change from pgoutput to wal2json plugin. I propose not to change entirely to wal2json, just add the drop down with desirable plugin, so user can select a suitable one. And by default leave pgoutput plugin, so no impact on existing connectors. Only two places need to be edited PostgresCdcProperties and PostgresSource.
@irynakruk thanks for the update. Would the users also have to make any change in their database to enable wal2json
?
@subodh1810 it depends on user's database. For example, Azure Database for Postgres, PostgreSQL on Amazon RDS, Amazon Aurora PostgreSQL RDS - all have wal2json extension added be default, since Postgres 9.5 and higher. Other types of instances (e.g. Bare Metal, VMs (EC2/GCE/etc), Docker, etc.) may require additional installation of wal2json plugin.
Hey @irynakruk I think if thats the case then we can go ahead and make the change to use wal2json
Please make sure that any change that we make should be backward compatible for existing connectors but for new connectors we should make the best default options
@subodh1810 sure, I'll set pgoutput
plugin as default option, so existing connectors would be not affected at all.
@irynakruk Is there a reason to not use wal2json by default if it exists on the DB server?
I would love to hide this complexity from the user if possible because it's a very technical choice.
@sherifnada it is not always installed by default with PostgreSQL, so it may require additional installation of wal2json
from the user. So as I mentioned above Azure Database for Postgres, PostgreSQL on Amazon RDS, Amazon Aurora PostgreSQL RDS - all have wal2json extension added by default, but all other types of DB servers may not.
Do we know what is the most common DB servers used by the user?
Just came to my mind, that in case of changing completely to a new plugin, it would affect existing connections.
I and @irynakruk and @yaroslav-hrytsaienko had a call about this today and the outcome was
pgoutput
plugin and use it as a default. The prime reason being that in order to use wal2json
, a user might have to perform extra steps on their postgres database (it comes as a default plugin in cloud postgres but few bare metal instances might not have it). Secondly it works in more general purpose scenarios. So we can use the pgoutput
as default which enables users to not think about which plugin to choose if they are not sure what would it mean. This is also required to be backward compatible since there are already postgres connectors running using this.thanks for the updates everyone!
A user reported experiencing OOM with Postgres CDC. The Postgres CDC was migrated to use the new cdc abstraction which means its already using the capped queue. The table
loan_applications
is huge, contains lots of columns and few of them being JSON columns containing big json blobs, also has around 104332 records. User was running the sync ont3.xlarge
machine. The table is of 1.6GB in sizeand JSON blobs can be as big bas 47499 bytes
My gut feeling is that this OOM has to do with the JSON blobs being too big to fir in memory.
Ref : https://airbytehq.slack.com/archives/C01MFR03D5W/p1626281129333800?thread_ts=1626122544.233800&cid=C01MFR03D5W logs-54-2.txt loan_applications.txt