[CT-3361] Improve Docs Parsing Performance

peterallenwebb commented 11 months ago

We've received a complaint that dbt-core's parsing performance is surprisingly slow for large docs files. On an M1 Mac, files of around 500K can take over a minute to parse, and appears to increase super-linearly. The critically slow step is the call of extract_toplevel_blocks() on the file contents. The extraction of top-level jinja blocks is could likely be made much faster, but this is extremely critical code and we need to preserve existing behavior.

This does not appear to be a regression, but current performance is embarrassingly bad.

To generate a file which reproduces the performance problem, repeat the following snippet a few thousand times in a text file with the .md (markdown) extension, and add it to a dbt project, or call extract_toplevel_blocks() on it directly.

{% docs table_events %}

This table contains clickstream events from the marketing website.

The events in this table are recorded by Snowplow and piped into the warehouse on an hourly basis. The following pages of the marketing site are tracked:
 - /
 - /about
 - /team
 - /contact-us

{% enddocs %}{% docs table_events %}

This table contains clickstream events from the marketing website.

The events in this table are recorded by Snowplow and piped into the warehouse on an hourly basis. The following pages of the marketing site are tracked:
 - /
 - /about
 - /team
 - /contact-us

{% enddocs %}

Impact on other teams

None

Needs backport?

Unsure

fredriv commented 6 months ago

Is there any progress on this issue? Our dbt docs are about 1M and full project parse (dbt parse --no-partial-parse) takes about 2-3 minutes on M1 Mac.

larssnek commented 4 months ago

@aranke here is the issue mentioned during the dbt meetup today on slow documentation parsing. There is also a closed PR that proposed a fix to this.

Hope you will be able to prioritize this 🙏🤩

fredriv commented 3 weeks ago

Here is a flame graph of doing a full parse of our dbt project (~2300 models). Our documentation markdown file is just shy of 1MB.

As you can see, extract_toplevel_blocks() takes about 75% of the time of dbt parse:

dbt-full-parse-flamegraph

If we empty out our Markdown docs file and remove all doc references from our config files, the dbt parse runs about 4x faster.

fredriv commented 3 weeks ago

Have replicated the changes in https://github.com/dbt-labs/dbt-core/pull/9045 in a new PR for dbt-common: https://github.com/dbt-labs/dbt-common/pull/189

This change reduces dbt parse for our dbt project from 2m20s to 41s on my M1 Mac.

dbt-labs / dbt-core

[CT-3361] Improve Docs Parsing Performance #9037

Impact on other teams

Needs backport?