Open alexnikitchuk opened 7 months ago
Another observation on the Snowflake destination side which affects costs and performance: the average size of the loaded file is just ~13Mb, whereas Snowflake recommends 100-250Mb or larger
What replication mode are you using with posgres? CDC, xmin, or cursor? Is your data changing as the sync is happening? In your snowflake data (raw tables) do you see duplicate entries for the same primary key?
Can you attach the full logs from your sync?
What replication mode are you using with posgres? CDC, xmin, or cursor? Is your data changing as the sync is happening? In your snowflake data (raw tables) do you see duplicate entries for the same primary key?
Can you attach the full logs from your sync?
we use xmin. Pls find the logs attached default_workspace_job_93_attempt_1_txt (3).txt.zip
Is your data changing as the sync is happening?
What issues could this cause?
A few thoughts:
I've tried to use CDC instead of xmin and the performance is the same: it loads 11M rows or 5Gb table in 1 hour 6 mins. The connection is syncing a single table. Please find the logs attached default_workspace_logs_125_txt.txt.zip
Yes, the data in the table is pretty sparse.
I've also tried to build a custom image with a fixed and increased value of fetchSize
for xmin and it does not speed up the sync. So the issue is something else.
I've tried to use CDC instead of xmin and the performance is the same: it loads 11M rows or 5Gb table in 1 hour 6 mins. The connection is syncing a single table. Please find the logs attached default_workspace_logs_125_txt.txt.zip
Yes, the data in the table is pretty sparse.
I've also tried to build a custom image with a fixed and increased value of
fetchSize
for xmin and it does not speed up the sync. So the issue is something else.
We've been moving some massive tables with Airbyte recently (Postgres to Snowflake). I've tried all kinds of configurations. The fastest Airbyte is able to do is 10M rows per hour. And this can drop to as low as 1M rows per hour. It's still unclear why the drop in some of the tables, but there seems to be a correlation with the number of columns in the table.
I've tried to use CDC instead of xmin and the performance is the same: it loads 11M rows or 5Gb table in 1 hour 6 mins. The connection is syncing a single table. Please find the logs attached default_workspace_logs_125_txt.txt.zip Yes, the data in the table is pretty sparse. I've also tried to build a custom image with a fixed and increased value of
fetchSize
for xmin and it does not speed up the sync. So the issue is something else.We've been moving some massive tables with Airbyte recently (Postgres to Snowflake). I've tried all kinds of configurations. The fastest Airbyte is able to do is 10M rows per hour. And this can drop to as low as 1M rows per hour. It's still unclear why the drop in some of the tables, but there seems to be a correlation with the number of columns in the table.
After increasing the worker node(s) to r8g.2xlarge
, we've had a sync of 55M rows complete in about 3hrs. So, try over-provisioning than you think necessary.
Connector Name
source-postgres
Connector Version
3.3.23
What step the error happened?
Other
Relevant information
We deploy Airbyte on EC2
m6g.2xlarge
which is 8 CPU and 32GB. Connection is limited to using 8 CPUs and 16 Gb memory. The source table size as per Postrgespg_relation_size
is just 5Gb, 11M rows. The resulting raw table in Snowflake size is 2.2Gb. Airbyte web shows it as 95 Gb and syncs it in 1h 9m.Maybe it calculates the size of rows not correctly? Also, looks like adaptive fetch size impacts sync speed significantly because it jumps from 115k rows to just 700 rows, median value is 6k. Below is the graph of how fetch size changes over time during sync
I was monitoring source/destination containers resources: source memory usage was very low < 0.5Gb, destination - ~2Gb
Relevant log output
No response
Contribute