databricks / dbt-databricks

A dbt adapter for Databricks.
https://databricks.com
Apache License 2.0
219 stars 120 forks source link

Seed is slow with big CSV files with many rows #500

Closed leo-schick closed 11 months ago

leo-schick commented 11 months ago

Describe the bug

Using dbt seed with the databricks adapter is quite slow when processing CSV big files which have many rows. I have two bigger CSV files with more than 100k rows. Those take too long:

Model Rows File Size Runtime Runtime in minutes
CompanyLinkCustomerAccount 863.146 21 MB 1336.36s 22.3 Minutes
MasterCustomerAccountLink 389.598 16 MB 820.04s 13.6 Minutes

Steps To Reproduce

  1. Create a CSV file with more than 100k rows
  2. add the CSV as seed
  3. execute dbt seed -s <your_file>

Expected behavior

I would have expected a runtime about 1-5 minute max.

Screenshots and log output

image

System information

Running with dbt=1.6.2 Registered adapter: databricks=1.6.4

Databricks Cluster configuration

image

nrichards17 commented 11 months ago

Hey @leo-schick, I helped resolve this slow seed problem in #493! I believe that fix is now in main but has not been released yet as part of the 1.7.0 release. Haven't had a seed that was >100k, but I was able to get seeds with 50k records to load in ~1 minute each.

(Duplicate of #476)

benc-db commented 11 months ago

What @nrichards17 said. The one caveat is with parquet as the landed file format, where we still have to do the slow thing. 1.7.0 will be out shortly (hopefully today or tomorrow).

leo-schick commented 11 months ago

The time improved, but is still not good enough I would say. After upgrading to 1.7.1, I have now a total runtime of around 15 minutes.

The models shown above have now these runtime times:

Model Rows File Size Runtime Runtime in minutes
CompanyLinkCustomerAccount 863.146 21 MB 911.19s 15.2 Minutes
MasterCustomerAccountLink 389.598 16 MB 556.46s 9.3 Minutes

This leads to the following comparison:

Model Runtime before Runtime after Delta in %
CompanyLinkCustomerAccount 22.3 Minutes 15.2 Minutes -7.1 Minutes -31.8 %
MasterCustomerAccountLink 13.6 Minutes 9.3 Minutes -4,3 Minutes -31.6 %

So, on average a performance improvement of 32% was gained by #493.

Since we work here with big data technology, I think this is still far from acceptable.

I don't know if that has been tested, but I guess running a python model with spark.read.option("delimiter", ",").csv(path) would perform better. But I guess in a python models one cannot use csv files from the code base.

nrichards17 commented 11 months ago

@leo-schick I think you're starting to hit the limit of what dbt seeds were designed for; if you're getting into the hundreds of thousands of records, I'd recommend to start looking at a different form of ingestion into your warehouse.

The dbt seed command isn't a very efficient way to ingest large files because it essentially has to construct one big INSERT INTO SQL statement of all values in the local CSV.

If you still need to use seeds, you could take inspiration from The Tuva Project where they host their large files externally in cloud storage (ex. s3) but still use the dbt seed command with a database-specific load__seed macro (customized per database type) that leverages the COPY INTO statement, which gets called as a post-hook defined in the dbt_project.yml file

benc-db commented 11 months ago

@leo-schick what @nrichards17 says is correct. Seeds are not intended for tables with hundreds of thousands of rows. I'm not even sure that seeds should be supported in dbt at all (personal opinion, obviously not shared by dbt-core), which aims to be your transform layer, not your ingest. They exist as a convenience for when you have to, for example, manage a small exclusion list that otherwise doesn't exist in your system. From the dbt docs:

Seeds are best suited to static data which changes infrequently.

Good use-cases for seeds:

Poor use-cases of dbt seeds:

You may find https://learn.microsoft.com/en-us/azure/databricks/ingestion/auto-loader/ a more appropriate tool for this case.

leo-schick commented 11 months ago

In my use case, it is neither of the above: Those data is "frozen" static data which will never change, since the old system does not exist anymore (data is frozen). Something like a "customer account ID mapping old / new system". It felt just right to me to put this data to the code since it is not part of any source system and opening a new data silo (e.g. a new single database) felt unconvenient. This brought me to the idea to put it into dbt seed csv files into the repository where the data logic is placed.