dbt-labs / dbt-core

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
https://getdbt.com
Apache License 2.0
9.83k stars 1.62k forks source link

Do some parsing stages in parallel #3164

Closed gshank closed 3 years ago

gshank commented 3 years ago

The overhead of parsing a large dbt project is substantial, and is currently not done in parallel. Currently most of parsing happens by looping through the various "parsers", such as Model, Snapshot, Analysis, DataTest. Hook, Seed, Documentation and Schema. The SchemaParser loops through seven sub-parsers: models, seeds, snapshots, sources, macros, analyses and exposures. This is all single-threaded. In a typical project most of the activity occurs in the Model and Schema parsers. After the loop through the parsers, additional parsing activity takes places in the various "patch" and "process" activities, which need to be after the parsing loop. But the parsing loop activities should be able to be parallelized.

Implement a method to perform parsing of the SQL and Yaml file trees in parallel.

github-actions[bot] commented 3 years ago

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

ajbosco commented 2 years ago

We've noticed a big slow down in the SchemaParser as we've added more models to our dbt project. It is taking over a minute on each run now. Is this part of the project still single threaded or could it be parallelized for better performance?

jtcohen6 commented 2 years ago

Hey @ajbosco, there are definitely still opportunities here to improve performance. This issue was open from well before we decided to meaningfully reinvest in partial parsing as a way to dramatically speed up the development experience for most users.

Out of curiosity, are you using the latest version of dbt? If you're using v0.19 or newer, you can run dbt parse and see detailed parse timing in target/perf_info.json.

ajbosco commented 2 years ago

Yeah we're on v1.04. This is what our target/perf_info.json looks like:

{
    "path_count": 3609,
    "parsed_path_count": 2834,
    "static_analysis_path_count": 1373,
    "static_analysis_parsed_path_count": 0,
    "is_partial_parse_enabled": true,
    "is_static_analysis_enabled": true,
    "read_files_elapsed": 3.8483023373410106,
    "load_macros_elapsed": 0.0016467268578708172,
    "parse_project_elapsed": 111.70640437630937,
    "patch_sources_elapsed": 0.7615224607288837,
    "process_manifest_elapsed": 0.3456896599382162,
    "load_all_elapsed": 122.93492755573243,
    "projects": [
        {
            "project_name": "our-project",
            "elapsed": 111.69869000883773,
            "parsers": [
                {
                    "parser": "model",
                    "elapsed": 29.303621151018888,
                    "parsed_path_count": 1373
                },
                {
                    "parser": "snapshot",
                    "elapsed": 0.21838706266134977,
                    "parsed_path_count": 12
                },
                {
                    "parser": "test",
                    "elapsed": 0.8616946050897241,
                    "parsed_path_count": 82
                },
                {
                    "parser": "test",
                    "elapsed": 81.31392021710053,
                    "parsed_path_count": 1367
                }
            ],
            "parsed_path_count": 2834
        },
        {
            "project_name": "dbt_snowflake",
            "elapsed": 0.0,
            "parsers": [],
            "parsed_path_count": 0
        },
        {
            "project_name": "dbt",
            "elapsed": 0.0,
            "parsers": [],
            "parsed_path_count": 0
        },
        {
            "project_name": "dbt_artifacts",
            "elapsed": 0.0,
            "parsers": [],
            "parsed_path_count": 0
        },
        {
            "project_name": "dbt_utils",
            "elapsed": 0.0,
            "parsers": [],
            "parsed_path_count": 0
        },
        {
            "project_name": "dbt_expectations",
            "elapsed": 0.0,
            "parsers": [],
            "parsed_path_count": 0
        },
        {
            "project_name": "dbt_date",
            "elapsed": 0.0,
            "parsers": [],
            "parsed_path_count": 0
        }
    ]
}
jtcohen6 commented 2 years ago

@ajbosco Got it, thanks for sharing! Is that true every time you re-run dbt parse? I see a lot of file diffs (2834 files re-parsed out of 3609), and I'm wondering what might cause so many changes to be detected

ajbosco commented 2 years ago

Yeah it is doing that every time. I'm trying to figure out why as well since partial parsing only detects one file changed.

13:55:32.351869 [info ] [MainThread]: Running with dbt=1.0.4
13:55:32.352829 [debug] [MainThread]: running dbt with arguments Namespace(record_timing_info=None, debug=None, log_format=None, write_json=None, use_colors=None, printer_width=None, warn_error=None, version_check=None, partial_parse=None, single_threaded=False, use_experimental_parser=None, static_parser=None, send_anonymous_usage_stats=None, fail_fast=None, event_buffer_size=None, project_dir=None, profile=None, target='dev', vars='{}', log_cache_events=False, write_manifest=False, compile=False, threads=None, defer=None, state=None, cls=<class 'dbt.task.parse.ParseTask'>, which='parse', rpc_method='parse')
13:55:32.353227 [debug] [MainThread]: Tracking: do not track
13:55:32.353741 [info ] [MainThread]: Start parsing.
13:55:32.414970 [info ] [MainThread]: Dependencies loaded
13:55:34.920625 [info ] [MainThread]: ManifestLoader created
13:55:38.300516 [debug] [MainThread]: Partial parsing enabled: 0 files deleted, 0 files added, 1 files changed.
ajbosco commented 2 years ago

Ah ha! Figured it out. We added an env var to one of our bigger source files that looks like this:

schema: '{{ env_var("OUR_SCHEMA_ENV_VAR", "default_value") }}'

If we rely on the default value then the partial parser always thinks the file as changed and considered all the sources deleted.

https://github.com/dbt-labs/dbt-core/blob/220d8b888c44d43fe54539684f397a05dacf0f02/core/dbt/parser/partial.py#L959

jtcohen6 commented 2 years ago

If we rely on the default value

Ah!!! Nice find!

Could I ask you to open a bug report for just that behavior? I bet it wouldn't be too tricky to handle, and it sounds like it could make a huge difference in your parse times.