airbytehq / airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
https://airbyte.com
Other
16.29k stars 4.15k forks source link

Postgres CDC OOM issue #4746

Closed subodh1810 closed 3 years ago

subodh1810 commented 3 years ago

A user reported experiencing OOM with Postgres CDC. The Postgres CDC was migrated to use the new cdc abstraction which means its already using the capped queue. The table loan_applications is huge, contains lots of columns and few of them being JSON columns containing big json blobs, also has around 104332 records. User was running the sync on t3.xlarge machine. The table is of 1.6GB in size

rmenezes=> SELECT pg_size_pretty( pg_total_relation_size('loan_applications') );
 pg_size_pretty 
----------------
 1656 MB
(1 row)

and JSON blobs can be as big bas 47499 bytes

 rmenezes=> select
rmenezes->   max(pg_column_size(cross_river_loan)),
rmenezes->   max(pg_column_size(derived_values)),
rmenezes->   max(pg_column_size(transunion_prequal_credit_data)),
rmenezes->   max(pg_column_size(transunion_hard_credit_data)),
rmenezes->   max(pg_column_size(socure_fraud_data)),
rmenezes->   max(pg_column_size(experiment_params)),
rmenezes->   max(pg_column_size(start_page_query_params)),
rmenezes->   max(pg_column_size(smarty_streets_result)),
rmenezes->   max(pg_column_size(offers)),
rmenezes->   max(pg_column_size(selected_offer)),
rmenezes->   max(pg_column_size(pre_application_offer)),
rmenezes->   max(pg_column_size(contract_offer)),
rmenezes->   max(pg_column_size(disburse_offer)),
rmenezes->   max(pg_column_size(possible_first_payment_dates)),
rmenezes->   max(pg_column_size(financial_goals))
rmenezes-> from loan_applications;
 max | max  |  max  | max  |  max  | max | max | max  |  max  | max  | max | max  | max  | max | max 
-----+------+-------+------+-------+-----+-----+------+-------+------+-----+------+------+-----+-----
     | 7027 | 17726 | 2285 | 10533 | 280 | 926 | 1268 | 47499 | 4234 | 266 | 1801 | 1801 | 117 |  66

My gut feeling is that this OOM has to do with the JSON blobs being too big to fir in memory.

Ref : https://airbytehq.slack.com/archives/C01MFR03D5W/p1626281129333800?thread_ts=1626122544.233800&cid=C01MFR03D5W logs-54-2.txt loan_applications.txt

sherifnada commented 3 years ago

Next steps:

  1. reliably reproduce this issue locally
  2. if the issue is happening because JSON Blobs are too big to fit in memory then verify if reducing the Debezium batch size from 10000 to something smaller (e.g: 1k). if so then we should make it configurable by the user
irynakruk commented 3 years ago

Hi @sherifnada and @subodh1810 ,

As you know I was able to reproduce this OOM error locally. I've tried to fix this issue with Debezium configuration parameters, unfortunately it didn't help, so after deeper investigation I found out that in order to process big JSON blobs it is recommended to use wal2json plugin instead of pgoutput plugin. I've changed configuration to wal2json plugin and replication was successful.

So I would propose to add possibility for customers to select plugin type. Usage of the wal2json plugin requires additional installations on customer side, so we should indicate that it's recommended for the big JSON blobs replication.

Please let me know what you think about this solution and I'll start with implementation.

subodh1810 commented 3 years ago

@irynakruk can you explain what was the root cause behind the OOM? Also changing from pgoutput to wal2json seems like a big task. Will it be backward compatible (will existing connectors continue to sync data if we make a change from pgoutput to wal2json)? Also cant we reduce the queue size here to tackle this?

irynakruk commented 3 years ago

@subodh1810 sure, I shared my steps with reproducing OOM below.

I red that wal2json plugin had similar problem with OOM, which was fixed, so I tried it and replication was successful.

And regarding change from pgoutput to wal2json plugin. I propose not to change entirely to wal2json, just add the drop down with desirable plugin, so user can select a suitable one. And by default leave pgoutput plugin, so no impact on existing connectors. Only two places need to be edited PostgresCdcProperties and PostgresSource.

subodh1810 commented 3 years ago

@irynakruk thanks for the update. Would the users also have to make any change in their database to enable wal2json ?

irynakruk commented 3 years ago

@subodh1810 it depends on user's database. For example, Azure Database for Postgres, PostgreSQL on Amazon RDS, Amazon Aurora PostgreSQL RDS - all have wal2json extension added be default, since Postgres 9.5 and higher. Other types of instances (e.g. Bare Metal, VMs (EC2/GCE/etc), Docker, etc.) may require additional installation of wal2json plugin.

subodh1810 commented 3 years ago

Hey @irynakruk I think if thats the case then we can go ahead and make the change to use wal2json Please make sure that any change that we make should be backward compatible for existing connectors but for new connectors we should make the best default options

irynakruk commented 3 years ago

@subodh1810 sure, I'll set pgoutput plugin as default option, so existing connectors would be not affected at all.

sherifnada commented 3 years ago

@irynakruk Is there a reason to not use wal2json by default if it exists on the DB server?

I would love to hide this complexity from the user if possible because it's a very technical choice.

irynakruk commented 3 years ago

@sherifnada it is not always installed by default with PostgreSQL, so it may require additional installation of wal2json from the user. So as I mentioned above Azure Database for Postgres, PostgreSQL on Amazon RDS, Amazon Aurora PostgreSQL RDS - all have wal2json extension added by default, but all other types of DB servers may not. Do we know what is the most common DB servers used by the user?

irynakruk commented 3 years ago

Just came to my mind, that in case of changing completely to a new plugin, it would affect existing connections.

subodh1810 commented 3 years ago

I and @irynakruk and @yaroslav-hrytsaienko had a call about this today and the outcome was

  1. We need to continue to support pgoutput plugin and use it as a default. The prime reason being that in order to use wal2json , a user might have to perform extra steps on their postgres database (it comes as a default plugin in cloud postgres but few bare metal instances might not have it). Secondly it works in more general purpose scenarios. So we can use the pgoutput as default which enables users to not think about which plugin to choose if they are not sure what would it mean. This is also required to be backward compatible since there are already postgres connectors running using this.
  2. We need to add as much description as possible in the documentation and the setup wizard to highlight the difference between the two plugins and make user understand about them and how to setup each of the plugin
sherifnada commented 3 years ago

thanks for the updates everyone!