Seed is slow with big CSV files with many rows

leo-schick commented 11 months ago

Describe the bug

Using dbt seed with the databricks adapter is quite slow when processing CSV big files which have many rows. I have two bigger CSV files with more than 100k rows. Those take too long:

Model	Rows	File Size	Runtime	Runtime in minutes
`CompanyLinkCustomerAccount`	863.146	21 MB	1336.36s	22.3 Minutes
`MasterCustomerAccountLink`	389.598	16 MB	820.04s	13.6 Minutes

Steps To Reproduce

Create a CSV file with more than 100k rows
add the CSV as seed
execute dbt seed -s <your_file>

Expected behavior

I would have expected a runtime about 1-5 minute max.

Screenshots and log output

System information

Running with dbt=1.6.2 Registered adapter: databricks=1.6.4

Databricks Cluster configuration

nrichards17 commented 11 months ago

Hey @leo-schick, I helped resolve this slow seed problem in #493! I believe that fix is now in main but has not been released yet as part of the 1.7.0 release. Haven't had a seed that was >100k, but I was able to get seeds with 50k records to load in ~1 minute each.

(Duplicate of #476)

benc-db commented 11 months ago

What @nrichards17 said. The one caveat is with parquet as the landed file format, where we still have to do the slow thing. 1.7.0 will be out shortly (hopefully today or tomorrow).

leo-schick commented 11 months ago

The time improved, but is still not good enough I would say. After upgrading to 1.7.1, I have now a total runtime of around 15 minutes.

The models shown above have now these runtime times:

Model	Rows	File Size	Runtime	Runtime in minutes
`CompanyLinkCustomerAccount`	863.146	21 MB	911.19s	15.2 Minutes
`MasterCustomerAccountLink`	389.598	16 MB	556.46s	9.3 Minutes

This leads to the following comparison:

Model	Runtime before	Runtime after	Delta	in %
`CompanyLinkCustomerAccount`	22.3 Minutes	15.2 Minutes	-7.1 Minutes	-31.8 %
`MasterCustomerAccountLink`	13.6 Minutes	9.3 Minutes	-4,3 Minutes	-31.6 %

So, on average a performance improvement of 32% was gained by #493.

Since we work here with big data technology, I think this is still far from acceptable.

I don't know if that has been tested, but I guess running a python model with spark.read.option("delimiter", ",").csv(path) would perform better. But I guess in a python models one cannot use csv files from the code base.

nrichards17 commented 11 months ago

@leo-schick I think you're starting to hit the limit of what dbt seeds were designed for; if you're getting into the hundreds of thousands of records, I'd recommend to start looking at a different form of ingestion into your warehouse.

The dbt seed command isn't a very efficient way to ingest large files because it essentially has to construct one big INSERT INTO SQL statement of all values in the local CSV.

If you still need to use seeds, you could take inspiration from The Tuva Project where they host their large files externally in cloud storage (ex. s3) but still use the dbt seed command with a database-specific load__seed macro (customized per database type) that leverages the COPY INTO statement, which gets called as a post-hook defined in the dbt_project.yml file

benc-db commented 11 months ago

@leo-schick what @nrichards17 says is correct. Seeds are not intended for tables with hundreds of thousands of rows. I'm not even sure that seeds should be supported in dbt at all (personal opinion, obviously not shared by dbt-core), which aims to be your transform layer, not your ingest. They exist as a convenience for when you have to, for example, manage a small exclusion list that otherwise doesn't exist in your system. From the dbt docs:

Seeds are best suited to static data which changes infrequently.

Good use-cases for seeds:

A list of mappings of country codes to country names
A list of test emails to exclude from analysis
A list of employee account IDs

Poor use-cases of dbt seeds:

Loading raw data that has been exported to CSVs
Any kind of production data containing sensitive information. For example personal identifiable information (PII) and passwords.

You may find https://learn.microsoft.com/en-us/azure/databricks/ingestion/auto-loader/ a more appropriate tool for this case.

leo-schick commented 11 months ago

In my use case, it is neither of the above: Those data is "frozen" static data which will never change, since the old system does not exist anymore (data is frozen). Something like a "customer account ID mapping old / new system". It felt just right to me to put this data to the code since it is not part of any source system and opening a new data silo (e.g. a new single database) felt unconvenient. This brought me to the idea to put it into dbt seed csv files into the repository where the data logic is placed.

databricks / dbt-databricks