codeforkjeff / dbt-sqlite

A SQLite adapter plugin for dbt (data build tool)
Apache License 2.0
78 stars 13 forks source link

Decrease batch size for seeds #35

Closed tomreitz closed 1 year ago

tomreitz commented 1 year ago

Thanks for your work on this adapter, it's great.

The batch size to use when loading seeds is specified here as 100000.

I'm not sure how the 100k value was determined, but I ran into the following SQLite error when trying to load a seed file of 70k rows, 5 cols, 2.37MB (a dictionary of common words in 14 languages):

17:46:24  7 of 7 ERROR loading seed file main.synth_words ................................ [ERROR in 2.54s]
...
17:46:24  Database Error in seed synth_words (seeds/synth_words.csv)
17:46:24    too many SQL variables

I resolved the error by manually changing the batch size in macros/materializations/seed/seed.sql from 100k down to 10k.

SQLite's limits include SQLITE_MAX_VARIABLE_NUMBER

which defaults to 999 for SQLite versions prior to 3.32.0 (2020-05-22) or 32766 for SQLite versions after 3.32.0.

(I'm running SQLite 3.31.1.) It seems that the number of variables used in the seed query scales with batch size.

Sidebar: dbt itself specifies a maximum file size for seeds but this is only used to determine whether or not to hash the file. dbt discourages using very large seed files, but does not impose a limit as far as I can tell.

Would it be possible to decrease the batch size from 100k to say 10k or so? and/or use a configurable seed_batch_size parameter?

My guess is that the optimal value probably depends on both the number of rows and columns in the seed file, so it may be difficult to hard-code and justify any specific batch size. But something smaller, while resulting in more queries against the database, should still be fairly performant.

codeforkjeff commented 1 year ago

Bumping down to 10k sounds good, thanks for reporting this. I'll cut a new release containing this fix.