Closed joevandyk closed 6 years ago
Here's what some quick profiler work shows (I only wrapped run_flat
because that was obviously the problem area, and I snipped out all the cruft at the bottom):
21752731 function calls (21363515 primitive calls) in 18.556 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 18.557 18.557 /Users/jake/src/fishtown/dbt/dbt/runner.py:219(run_flat)
1 0.000 0.000 18.557 18.557 /Users/jake/src/fishtown/dbt/dbt/runner.py:176(run_from_graph)
1 0.000 0.000 17.973 17.973 /Users/jake/src/fishtown/dbt/dbt/runner.py:170(compile)
1 0.000 0.000 17.973 17.973 /Users/jake/src/fishtown/dbt/dbt/compilation.py:299(compile)
1 0.000 0.000 17.442 17.442 /Users/jake/src/fishtown/dbt/dbt/loader.py:13(load_all)
6 0.000 0.000 17.438 2.906 /Users/jake/src/fishtown/dbt/dbt/loader.py:32(load_all)
2 0.000 0.000 16.529 8.264 /Users/jake/src/fishtown/dbt/dbt/loader.py:180(load_project)
2 0.000 0.000 16.529 8.264 /Users/jake/src/fishtown/dbt/dbt/parser/seeds.py:44(load_and_parse)
1 0.000 0.000 16.525 16.525 /Users/jake/src/fishtown/dbt/dbt/parser/seeds.py:17(parse_seed_file)
1 0.001 0.001 16.525 16.525 /Users/jake/src/fishtown/dbt/dbt/clients/agate_helper.py:40(from_csv)
1 0.002 0.002 16.524 16.524 /Users/jake/.pyenv/versions/3.6.5/envs/dbt36/lib/python3.6/site-packages/agate/table/from_csv.py:8(from_csv)
1 0.054 0.054 16.512 16.512 /Users/jake/.pyenv/versions/3.6.5/envs/dbt36/lib/python3.6/site-packages/agate/table/__init__.py:79(__init__)
92032 1.079 0.000 12.326 0.000 /Users/jake/.pyenv/versions/3.6.5/envs/dbt36/lib/python3.6/site-packages/parsedatetime/__init__.py:1940(nlp)
1 0.140 0.140 12.057 12.057 /Users/jake/.pyenv/versions/3.6.5/envs/dbt36/lib/python3.6/site-packages/agate/type_tester.py:73(run)
127145 0.078 0.000 11.839 0.000 /Users/jake/.pyenv/versions/3.6.5/envs/dbt36/lib/python3.6/site-packages/agate/data_types/base.py:21(test)
122702 1.021 0.000 10.350 0.000 /Users/jake/.pyenv/versions/3.6.5/envs/dbt36/lib/python3.6/site-packages/parsedatetime/__init__.py:1779(parse)
61353 0.209 0.000 8.523 0.000 /Users/jake/.pyenv/versions/3.6.5/envs/dbt36/lib/python3.6/site-packages/agate/data_types/date.py:47(cast)
30679 0.199 0.000 7.228 0.000 /Users/jake/.pyenv/versions/3.6.5/envs/dbt36/lib/python3.6/site-packages/agate/data_types/date_time.py:53(cast)
Looks like most of the time is spent in parsing dates and times and the agate TypeTester. One way to make this faster (~3x speedup on this test data) would be to allow users to specify the types of their csv columns, but that sounds potentially quite difficult and error-prone.
Would something like this help? https://github.com/closeio/ciso8601/blob/master/README.rst
Also, I think there’s a way already to specify types in dbt seeds? Seems familiar anyways.
Switching out csv parsers would be a bit involved as dbt
makes pretty heavy use of agate
. It's possible we could have some other parser ingest and then use that to call into agate
but it's quite a lot of work - I think we'd gain more by implementing #867. Is there a use case where faster seed parsing is very important and would be better than just not parsing at all?
Also that parser is fast in part because it exclusively supports a subset of valid iso8601/rfc3339 dates, which a lot of user data is not... for example, the sample data you provided :) The general case of parsing user-provided csv files is pretty tough, it's a bit of an under-specified format.
I don't believe there's any way to specify column types for seeds when invoking dbt
. Certainly not in the path this goes through, given the stuff I did to test it without the TypeTester.
totally agree @beckjake -- I think the move is to implement #867. Even if we could make this faster, we shouldn't be reading the files at all in this example!
I don't think there's much we can/should do on the type-inference front. Going to close this along with https://github.com/fishtown-analytics/dbt/issues/867 now that https://github.com/fishtown-analytics/dbt/pull/1046 is merged. Seeds may still be slow, but it should be confined to only invocations of dbt seed
!
Is there any chance of reopening this? I've been testing out seeds and have found that some files can take up to 4 mins to parse (~45MB), even if the full schema is provided. If not, I would suggest that this section from the docs is removed:
In addition to configuring schemas, the column types of a seed file can also be configured in the dbt_project.yml file. If this configuration is not provided, dbt will try to infer column types from a sample of the seed data.
I haven't dived into the code, but I suspect the inference is always done (or isn't in my version 0.12.1) and then overwritten.
hey @mayansalama - I think you're very right about that -- i can update the docs accordingly.
To clarify: dbt has two different notions of "column types" for a seed file
int
, float
, str
, etc). This is inferred by the dataframe library we use, called Agate.varchar
, bigint
, numeric
, etc). This is generated from the type that was inferred in step 1, and varies by database. This value can be overridden with the column_types
config.Generally, loading data into a warehouse is hard work! dbt seed does a good for job tiny datasets with uncomplicated schemas. For 45mb of data, you'll probably want to use a dedicated tool that exists to solve exactly this type of problem. Once you do that, you'll be able to use dbt to transform that data as you would with any other dataset!
Hope this helps, and thanks for the docs pointer :)
Makes sense :)
I agree that using dbt for a full ingestion pipeline is underkill, however for my purposes (a demo) it was very convenient! For the actual use-case I'm thinking of (CICD with containerised Postgres to test a pipeline), small data sizes suit the bill fine!
Cheers mate
dbt takes about 14 seconds to parse the attached CSV file (only 5k lines long).
This file contains a lot of dates, not sure if that's related.
retail_calendar.txt