duckdb / dbt-duckdb

dbt (http://getdbt.com) adapter for DuckDB (http://duckdb.org)
Apache License 2.0
816 stars 69 forks source link

Seed materialization is significantly slower than using `select * from 'filename.csv'`. #158

Closed dwreeves closed 1 year ago

dwreeves commented 1 year ago

The issue is that the default seed materialization for dbt involves loading everything into memory in the Python runtime, and does row by row inserts. DuckDB's builtin csv behavior is significantly faster than this approach.

The appropriate solution here is to override the implementation of the seed materialization. The 2 tricky parts may be (1) avoiding not just the inserts but the csv loading entirely, and (2) backwards compatibility.

jwills commented 1 year ago

Avoiding loading the CSV as an agate table isn't currently possible afaict, but I added a config option called fast that seeds can use to avoid the INSERTs and that should be backwards compatible with the current system; I'll leave it here for testing purposes for a bit and make it the default for version 1.6.

dwreeves commented 1 year ago

Very cool! Great job on this.