Implement and configure SQLFluff in the repo

vvcb commented 8 months ago

The purpose of this task is to enforce consistent SQL syntax throughout the repo, amongst other benefits. See https://sqlfluff.com/

vvcb commented 8 months ago

@katy-sadowski , picking this task up from the Teams conversation. Happy to do this but should we wait until we have decided on the project directory structure and the build up vs break down strategy that we are finalising. Else we may end up with multiple unmerged PRs that will result in conflicts. Sqlfluff is already part of my mega PR :wink: but happy to make a separate PR once we have settled on approach. Shouldn't take very long to do.

katy-sadowski commented 8 months ago

Thanks @vvcb ! Regarding directory structure - what are your thoughts on what I've put here (plus an Intermediate layer)? Any additional decisions to make on directories besides this?

Regarding build up vs break down - can you please elaborate on what you mean by that? Do you mean stuff like precommit hooks etc?

We can discuss it live next time we meet but I'm also happy to continue the convo here in the meantime :)

vvcb commented 8 months ago

@katy-sadowski , agree with moving the entire project to the root directory like you have done in your PR (and how it is in mine as well).

Regarding sub folders within the models folder, there are many ways to 'skin a cat' (never understood why there is one way let alone many but not sure where this phrase has come from).

In my team, we started with source, staging, vocab and omop folders and then added a few more as the need arose. Within those folders, the models have a consistent naming convention.

https://omop-lsc.surge.sh/rxn_dev/index.html#!/overview/dbt_omop

All this can get very complicated very quickly if we are not careful. So, starting simple (and starting itself) is a good thing and we can revisit frequently early and less frequently later.

In life and tech, I find PEP 20 is a good anchor.

By break down, I meant my approach of starting with the fully built lineage. Build up is the first principles approach you described. The fully built lineage presents many surfaces for people with varying skills to play on - breaking down into component models, making the code dialect agnostic one model at a time, documentation, tagging, adding macros, etc. Building up may be a lot more challenging to work together as a team - but this may simply be a function of my ignorance of how or where we would start.

katy-sadowski commented 8 months ago

Thanks for the additional context! For directory structure, I think let's stick with the dbt standard for now since we don't have a strong reason (yet) to deviate. Looking to the future, I think this will also be the best starting point for a more generic ETL project.

Regarding break down / build up - I am also starting to think it'll be easier to having the "skeleton" of the full ETL to work with vs starting with a blank slate. Either way though I do think we should probably have targeted PRs for the bits of the "mega" PR we want to incorporate. I'm not very familiar with SQLFluff, but my thinking in adding it in now was that we could ensure all SQL committed from the very beginning is linted and consistent. Do you agree with that?

And then I'm thinking:

create a convention-compliant base model for each Synthea table
add in the intermediate and final model queries from your PR
- convert to PostgreSQL
- migrate to standard dbt folder structure
- point at new base models
gradually convert that "skeleton" ETL into our desired format & flow

Thoughts?

OHDSI / dbt-synthea

Implement and configure SQLFluff in the repo #7