apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
36.16k stars 14.04k forks source link

bigquery_to_postgres performance issue #40948

Closed prithvi-git closed 1 month ago

prithvi-git commented 1 month ago

Apache Airflow version

2.9.3

If "Other Airflow 2 version" selected, which one?

No response

What happened?

BigQueryToPostgresOperator performance is very low. To transfer a table of size 250MB having 1.7M records, it takes almost half an hour. Tried with increased batch_size to 1000000, still not much improvement. Looks like BQ extraction is faster but loading into Postgres is very very slow.

[airflow.providers.google.cloud.transfers.bigquery_to_postgres] (https://airflow.apache.org/docs/apache-airflow-providers-google/stable/_api/airflow/providers/google/cloud/transfers/bigquery_to_postgres/index.html#module-airflow.providers.google.cloud.transfers.bigquery_to_postgres)

What you think should happen instead?

performance has to be improved.

How to reproduce

Try to copy a table from BQ to Postgres using BigQueryToPostgresOperator. Data size 250MB, # records ~1.5M

Operating System

Windows

Versions of Apache Airflow Providers

airflow-2.9.1

Deployment

Google Cloud Composer

Deployment details

composer-2.8.6-airflow-2.9.1

Anything else?

No response

Are you willing to submit PR?

Code of Conduct

raphaelauv commented 1 month ago

airflow is not an ETL tool or data migration tool.

Every operator moving data from A to airflow to B are just PythonOperator helpers, inefficient and bad dataeng practices

For performance use native capabilities ( like bigquery to/from gcs - postgres to/from X with a postgres extension like https://github.com/paradedb/paradedb/tree/dev/pg_lakehouse )

shahar1 commented 1 month ago

I have to agree with @raphaelauv - Airflow is not suitable for this scale of data transfers, and other technologies should be considered, such as Apache Beam or Apache Spark. Closing this issue as won't fix.