dbt-labs / dbt-core

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
https://getdbt.com
Apache License 2.0
9.82k stars 1.62k forks source link

Slow parsing of seed file #874

Closed joevandyk closed 6 years ago

joevandyk commented 6 years ago

dbt takes about 14 seconds to parse the attached CSV file (only 5k lines long).

2018-07-23 16:30:14,064: Parsing /Users/joe/projects/crowdcow_dbt/data/retail_calendar.csv
2018-07-23 16:30:28,241: Parsing seed.crowdcow_dbt.retail_calendar

This file contains a lot of dates, not sure if that's related.

retail_calendar.txt

drewbanin commented 6 years ago

see also https://github.com/fishtown-analytics/dbt/issues/867

beckjake commented 6 years ago

Here's what some quick profiler work shows (I only wrapped run_flat because that was obviously the problem area, and I snipped out all the cruft at the bottom):

         21752731 function calls (21363515 primitive calls) in 18.556 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   18.557   18.557 /Users/jake/src/fishtown/dbt/dbt/runner.py:219(run_flat)
        1    0.000    0.000   18.557   18.557 /Users/jake/src/fishtown/dbt/dbt/runner.py:176(run_from_graph)
        1    0.000    0.000   17.973   17.973 /Users/jake/src/fishtown/dbt/dbt/runner.py:170(compile)
        1    0.000    0.000   17.973   17.973 /Users/jake/src/fishtown/dbt/dbt/compilation.py:299(compile)
        1    0.000    0.000   17.442   17.442 /Users/jake/src/fishtown/dbt/dbt/loader.py:13(load_all)
        6    0.000    0.000   17.438    2.906 /Users/jake/src/fishtown/dbt/dbt/loader.py:32(load_all)
        2    0.000    0.000   16.529    8.264 /Users/jake/src/fishtown/dbt/dbt/loader.py:180(load_project)
        2    0.000    0.000   16.529    8.264 /Users/jake/src/fishtown/dbt/dbt/parser/seeds.py:44(load_and_parse)
        1    0.000    0.000   16.525   16.525 /Users/jake/src/fishtown/dbt/dbt/parser/seeds.py:17(parse_seed_file)
        1    0.001    0.001   16.525   16.525 /Users/jake/src/fishtown/dbt/dbt/clients/agate_helper.py:40(from_csv)
        1    0.002    0.002   16.524   16.524 /Users/jake/.pyenv/versions/3.6.5/envs/dbt36/lib/python3.6/site-packages/agate/table/from_csv.py:8(from_csv)
        1    0.054    0.054   16.512   16.512 /Users/jake/.pyenv/versions/3.6.5/envs/dbt36/lib/python3.6/site-packages/agate/table/__init__.py:79(__init__)
    92032    1.079    0.000   12.326    0.000 /Users/jake/.pyenv/versions/3.6.5/envs/dbt36/lib/python3.6/site-packages/parsedatetime/__init__.py:1940(nlp)
        1    0.140    0.140   12.057   12.057 /Users/jake/.pyenv/versions/3.6.5/envs/dbt36/lib/python3.6/site-packages/agate/type_tester.py:73(run)
   127145    0.078    0.000   11.839    0.000 /Users/jake/.pyenv/versions/3.6.5/envs/dbt36/lib/python3.6/site-packages/agate/data_types/base.py:21(test)
   122702    1.021    0.000   10.350    0.000 /Users/jake/.pyenv/versions/3.6.5/envs/dbt36/lib/python3.6/site-packages/parsedatetime/__init__.py:1779(parse)
    61353    0.209    0.000    8.523    0.000 /Users/jake/.pyenv/versions/3.6.5/envs/dbt36/lib/python3.6/site-packages/agate/data_types/date.py:47(cast)
    30679    0.199    0.000    7.228    0.000 /Users/jake/.pyenv/versions/3.6.5/envs/dbt36/lib/python3.6/site-packages/agate/data_types/date_time.py:53(cast)

Looks like most of the time is spent in parsing dates and times and the agate TypeTester. One way to make this faster (~3x speedup on this test data) would be to allow users to specify the types of their csv columns, but that sounds potentially quite difficult and error-prone.

joevandyk commented 6 years ago

Would something like this help? https://github.com/closeio/ciso8601/blob/master/README.rst

joevandyk commented 6 years ago

Also, I think there’s a way already to specify types in dbt seeds? Seems familiar anyways.

beckjake commented 6 years ago

Switching out csv parsers would be a bit involved as dbt makes pretty heavy use of agate. It's possible we could have some other parser ingest and then use that to call into agate but it's quite a lot of work - I think we'd gain more by implementing #867. Is there a use case where faster seed parsing is very important and would be better than just not parsing at all?

Also that parser is fast in part because it exclusively supports a subset of valid iso8601/rfc3339 dates, which a lot of user data is not... for example, the sample data you provided :) The general case of parsing user-provided csv files is pretty tough, it's a bit of an under-specified format.

I don't believe there's any way to specify column types for seeds when invoking dbt. Certainly not in the path this goes through, given the stuff I did to test it without the TypeTester.

drewbanin commented 6 years ago

totally agree @beckjake -- I think the move is to implement #867. Even if we could make this faster, we shouldn't be reading the files at all in this example!

drewbanin commented 6 years ago

I don't think there's much we can/should do on the type-inference front. Going to close this along with https://github.com/fishtown-analytics/dbt/issues/867 now that https://github.com/fishtown-analytics/dbt/pull/1046 is merged. Seeds may still be slow, but it should be confined to only invocations of dbt seed!

mayansalama commented 5 years ago

Is there any chance of reopening this? I've been testing out seeds and have found that some files can take up to 4 mins to parse (~45MB), even if the full schema is provided. If not, I would suggest that this section from the docs is removed:

In addition to configuring schemas, the column types of a seed file can also be configured in the dbt_project.yml file. If this configuration is not provided, dbt will try to infer column types from a sample of the seed data.

I haven't dived into the code, but I suspect the inference is always done (or isn't in my version 0.12.1) and then overwritten.

drewbanin commented 5 years ago

hey @mayansalama - I think you're very right about that -- i can update the docs accordingly.

To clarify: dbt has two different notions of "column types" for a seed file

  1. a python data type (like int, float, str, etc). This is inferred by the dataframe library we use, called Agate.
  2. a sql data type (like varchar, bigint, numeric, etc). This is generated from the type that was inferred in step 1, and varies by database. This value can be overridden with the column_types config.

Generally, loading data into a warehouse is hard work! dbt seed does a good for job tiny datasets with uncomplicated schemas. For 45mb of data, you'll probably want to use a dedicated tool that exists to solve exactly this type of problem. Once you do that, you'll be able to use dbt to transform that data as you would with any other dataset!

Hope this helps, and thanks for the docs pointer :)

mayansalama commented 5 years ago

Makes sense :)

I agree that using dbt for a full ingestion pipeline is underkill, however for my purposes (a demo) it was very convenient! For the actual use-case I'm thinking of (CICD with containerised Postgres to test a pipeline), small data sizes suit the bill fine!

Cheers mate