[source-postgres] Slow sync from Postrges to Snowflake

airbytehq / airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.

https://airbyte.com

Other

16.19k stars 4.14k forks source link

[source-postgres] Slow sync from Postrges to Snowflake #36967

Open alexnikitchuk opened 7 months ago

alexnikitchuk commented 7 months ago

Connector Name

source-postgres

Connector Version

3.3.23

What step the error happened?

Other

Relevant information

We deploy Airbyte on EC2 m6g.2xlarge which is 8 CPU and 32GB. Connection is limited to using 8 CPUs and 16 Gb memory. The source table size as per Postrges pg_relation_size is just 5Gb, 11M rows. The resulting raw table in Snowflake size is 2.2Gb. Airbyte web shows it as 95 Gb and syncs it in 1h 9m.

Maybe it calculates the size of rows not correctly? Also, looks like adaptive fetch size impacts sync speed significantly because it jumps from 115k rows to just 700 rows, median value is 6k. Below is the graph of how fetch size changes over time during sync

I was monitoring source/destination containers resources: source memory usage was very low < 0.5Gb, destination - ~2Gb

Relevant log output

No response

Contribute

[ ] Yes, I want to contribute

alexnikitchuk commented 7 months ago

Another observation on the Snowflake destination side which affects costs and performance: the average size of the loaded file is just ~13Mb, whereas Snowflake recommends 100-250Mb or larger

evantahler commented 7 months ago

What replication mode are you using with posgres? CDC, xmin, or cursor? Is your data changing as the sync is happening? In your snowflake data (raw tables) do you see duplicate entries for the same primary key?

Can you attach the full logs from your sync?

alexnikitchuk commented 7 months ago

What replication mode are you using with posgres? CDC, xmin, or cursor? Is your data changing as the sync is happening? In your snowflake data (raw tables) do you see duplicate entries for the same primary key?

Can you attach the full logs from your sync?

we use xmin. Pls find the logs attached default_workspace_job_93_attempt_1_txt (3).txt.zip

rowanmoul commented 7 months ago

Is your data changing as the sync is happening?

What issues could this cause?

evantahler commented 6 months ago

A few thoughts:

Are you using CDC? if you are, and you are syncing many tables at once, that could lead to the smaller upload file sizes. There's a fixed number of upload buffers based on the ram the destination has. If the data coming in for many streams exceeds that buffer count, we flush and make a new buffer for the new stream
Does your data have a lot of mostly-null columns? We'll count a few bytes (column-names, etc) for all columns, even if null, in our calculations.

alexnikitchuk commented 6 months ago

I've tried to use CDC instead of xmin and the performance is the same: it loads 11M rows or 5Gb table in 1 hour 6 mins. The connection is syncing a single table. Please find the logs attached default_workspace_logs_125_txt.txt.zip

Yes, the data in the table is pretty sparse.

I've also tried to build a custom image with a fixed and increased value of fetchSize for xmin and it does not speed up the sync. So the issue is something else.

kelvingakuo commented 3 months ago

I've tried to use CDC instead of xmin and the performance is the same: it loads 11M rows or 5Gb table in 1 hour 6 mins. The connection is syncing a single table. Please find the logs attached default_workspace_logs_125_txt.txt.zip

Yes, the data in the table is pretty sparse.

I've also tried to build a custom image with a fixed and increased value of fetchSize for xmin and it does not speed up the sync. So the issue is something else.

We've been moving some massive tables with Airbyte recently (Postgres to Snowflake). I've tried all kinds of configurations. The fastest Airbyte is able to do is 10M rows per hour. And this can drop to as low as 1M rows per hour. It's still unclear why the drop in some of the tables, but there seems to be a correlation with the number of columns in the table.

kelvingakuo commented 2 months ago

I've tried to use CDC instead of xmin and the performance is the same: it loads 11M rows or 5Gb table in 1 hour 6 mins. The connection is syncing a single table. Please find the logs attached default_workspace_logs_125_txt.txt.zip Yes, the data in the table is pretty sparse. I've also tried to build a custom image with a fixed and increased value of fetchSize for xmin and it does not speed up the sync. So the issue is something else.

We've been moving some massive tables with Airbyte recently (Postgres to Snowflake). I've tried all kinds of configurations. The fastest Airbyte is able to do is 10M rows per hour. And this can drop to as low as 1M rows per hour. It's still unclear why the drop in some of the tables, but there seems to be a correlation with the number of columns in the table.

After increasing the worker node(s) to r8g.2xlarge, we've had a sync of 55M rows complete in about 3hrs. So, try over-provisioning than you think necessary.