ccao-data / data-architecture

Codebase for CCAO data infrastructure construction and management
https://ccao-data.github.io/data-architecture/
5 stars 3 forks source link

Reconfigure sqlfluff linter and pre-commit hooks #456

Open dfsnow opened 1 month ago

dfsnow commented 1 month ago

It's been awhile since we've touched the SQLFluff/sqlfmt configuration we use for CI/CD. SQLFluff has been updated a lot in the last year or so. We should take a quick look at our configuration and make sure it's still doing everything we want/expect.

Specifically, I don't think the SQLFluff linting in the macros/ or tests/ directories is working quite right. It doesn't lint locally for me within macro blocks. Perhaps it's time for us to retry the dbt templater.

dfsnow commented 1 month ago

Note that SQLFluff just did a compatibility release for dbt 1.8: https://github.com/sqlfluff/sqlfluff/releases/tag/3.0.7

jeancochrane commented 1 month ago

I did a bunch of research on this today, including test driving the dbt templater and trying out a few different sets of configs, and unfortunately my conclusion is that sqlfluff does not support linting macro blocks in isolation. Since it's not supported using either the dbt templater or the jinja templater, and since the dbt templater is slightly slower, I think we should stick to the jinja templater for now.

It's not really documented anywhere, but it appears that the sqlfluff policy is to not lint macros because they are only fragments and hence may not be valid SQL. Macros do not get linted in isolation, but instead are linted when they are called by a non-macro SQL statement (like in a model), at which point the templater uses the macro to generate a full SQL statement and then validate it. This policy is particularly apparent when using the dbt templater, which explicitly skips macro files -- here's a sample log line when running sqlfluff lint with the dbt templater enabled:

WARNING    Skipped file /home/jecochr/data-architecture/dbt/macros/log_results.sql because it is a macro

This SO answer suggests using the dbt templater for models and the jinja templater for macros, but that doesn't work because the jinja templater will skip anything inside a {% macro %} block. (It will lint anything in a macro file but outside a {% macro %} block, however, so I do agree that it is probably a better choice for macro files, even though it doesn't really resolve the most important issue with the dbt templater.)

I couldn't find any discussion on this particular issue, but I did find some discussion and planned work around the related issue of linting unexecuted portions of jinja templates. It's possible this may eventually lead to support for linting macros in isolation, but it's hard to tell based on the current discussion, which is very weedsy. The fact that there's no discussion outside of the SO answer I linked leaves me less than confident that I've exhausted all options, however, so it's possible I've just missed something.

Note that I timed the dbt templater vs the jinja templater locally, and found that the docs are correct in that the dbt templater is slower. On my machine, linting the whole data-architecture repo takes about 28s with the jinja templater and 34s with the dbt templater. This is maybe a small enough difference that we're willing to accept it, but I don't think we should make the switch until we find a feature that we need in the dbt templater that isn't supported in jinja.

I also combed through the sqlfluff releases between 2.1.1 and 3.0.7 and didn't find anything particularly relevant to us. Nevertheless, I put up https://github.com/ccao-data/data-architecture/pull/478 to upgrade to the most recent version to make sure we stay up to date.

jeancochrane commented 1 month ago

Reopening in case Dan would like to continue pursuing this line of inquiry.

dfsnow commented 4 weeks ago

Thanks @jeancochrane, this looks great to me and aligns with my super preliminary poking around last week. Let's keep this open as an issue, as it does feel like something that could eventually be resolved by one of the upstream projects.