fal-ai / dbt-fal

do more with dbt. dbt-fal helps you run Python alongside dbt, so you can send Slack alerts, detect anomalies and build machine learning models.
https://fal.ai/dbt-fal
Apache License 2.0
853 stars 72 forks source link

Improve speed of inserts on PostgreSQL #766

Closed kinghuang closed 1 year ago

kinghuang commented 1 year ago

Description

dbt-fal currently uses a generic pandas.DataFrame.to_sql call when using adapters others than snowflake, bigquery, and duckdb. This results in an INSERT statement per row when writing dataframes to PostgreSQL, which is excruciatingly slow.

The pandas.DataFrame.to_sql can take an optional method argument that controls the SQL insertion clause. In particular, the Pandas user guide provides an example for PostgreSQL that uses COPY FROM to efficiently insert rows.

Here are some numbers from a Python model that I'm working on. The model reads and writes 4.1 million rows with two text columns. With dbt-fal as is, the model takes 1038 seconds total, of which about 660.9 seconds are spent writing data to PostgreSQL.

03:27:18  Running with dbt=1.4.1
03:27:18  Found 49 models, 0 tests, 0 snapshots, 0 analyses, 300 macros, 0 operations, 0 seed files, 30 sources, 2 exposures, 0 metrics
03:27:18  
03:27:23  Concurrency: 4 threads (target='dev-fal')
03:27:23  
03:27:23  1 of 1 START python table model dbt_king_stg_dibi_env.foundations_wells_names_cleaned  [RUN]
dur=660.9213428497314
03:44:41  1 of 1 OK created python table model dbt_king_stg_dibi_env.foundations_wells_names_cleaned  [OK in 1038.03s]
03:44:42  
03:44:42  Finished running 1 table model in 0 hours 17 minutes and 23.50 seconds (1043.50s).
03:44:42  
03:44:42  Completed successfully
03:44:42  
03:44:42  Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1

With the custom method, insertion time drops to just 35.2 seconds, over 18× faster.

03:19:52  Running with dbt=1.4.1
03:19:52  Found 49 models, 0 tests, 0 snapshots, 0 analyses, 300 macros, 0 operations, 0 seed files, 30 sources, 2 exposures, 0 metrics
03:19:52  
03:19:56  Concurrency: 4 threads (target='dev-fal')
03:19:56  
03:19:56  1 of 1 START python table model dbt_king_stg_dibi_env.foundations_wells_names_cleaned  [RUN]
dur=35.29836106300354
03:26:54  1 of 1 OK created python table model dbt_king_stg_dibi_env.foundations_wells_names_cleaned  [OK in 417.89s]
03:26:55  
03:26:55  Finished running 1 table model in 0 hours 7 minutes and 3.08 seconds (423.08s).
03:26:55  
03:26:55  Completed successfully
03:26:55  
03:26:55  Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1

The dur numbers come from temporary wall time calculations in the code.

start = time()
rows_affected = dataframe.to_sql(**to_sql_args)
end = time()
dur = (end - start)
print(f"{dur=}")

Integration tests

Adapter to test:

Python version to test:

chamini2 commented 1 year ago

Hey, @kinghuang.

Just released https://pypi.org/project/dbt-fal/1.4.2/, can you test this and confirm it looks good?

kinghuang commented 1 year ago

Just released https://pypi.org/project/dbt-fal/1.4.2/, can you test this and confirm it looks good?

Looks good! I did a pip update install to dbt-fal==1.4.2 and re-ran the same model. It completed without errors in 411 seconds

18:13:36  1 of 1 START python table model dbt_king_stg_dibi_env.foundations_wells_names_cleaned  [RUN]
18:20:27  1 of 1 OK created python table model dbt_king_stg_dibi_env.foundations_wells_names_cleaned  [OK in 410.56s]

Might be worth mentioning the change in the release notes.

chamini2 commented 1 year ago

Thank you! I will probably add this to our not-so-regular changelog blog.