feat: Add workflow to generate DBT documentation and deploy to GitHub Pages

lawrenceadams commented 1 month ago

This PR adds automatic dbt docs generation for the dbt-synthea project.

Why bother?

The docs are a nice place to start for those not used to either dbt / OMOP to explore the models/DAG/etc (IMHO - feel free to disagree!); is easier for us to point people to our project!

This approach requires little/no effort of maintainers once setup.

How does it work

runs when a relevant file in github has been modified in the main branch (as per paths filter)
installs python + requirements (caches as possible)
uses duckdb rendering of the project (lighterweight/friendly flavour of sql compared to others)
uploads to GitHub pages (this needs to be enabled by an admin)

Next steps

Enable pages (if approved/agreed as useful)
Add other docs to make prettier/complete - e.g. an overview page, etc.

katy-sadowski commented 1 month ago

I do agree with adding this! Thanks for the initiative! I think I enabled Pages:

If this looks right, go ahead and merge to test it out!

katy-sadowski commented 1 month ago

It works!!! https://ohdsi.github.io/dbt-synthea

For some reason, the data types are not showing up in the GH Pages site, though they do show up when I build it locally. Quick Google says this might be a fix? https://github.com/dbt-labs/dbt-spark/issues/988

lawrenceadams commented 1 month ago

Excellent!! 🥳

For some reason, the data types are not showing up in the GH Pages site, though they do show up when I build it locally. Quick Google says this might be a fix? dbt-labs/dbt-spark#988

Interesting... I have can reproduce the same thing. The documentation on the docs is ironically thin on how you should do this.

I can confirm that if you use type then the docs works. Buttt - the reason we don't have this issue locally is that if you have dbt run / dbt build your project and have your warehouse available then dbt will grab the data type metadata from the database. I.e. if we update the workflow to do dbt build then it will work fine.

Going down the rabbit hole, I am torn as to what to do:

The dbt yaml schema declares that data_type is correct (type is not valid)
data_type exists so that if you enable/enforce contracts then dbt will check the we have matched our expected schema on project build.
- I think it would be nice to be able to enforce checking of this if we want in the future - help to prove our code is CDM compliant etc.
- I tried enabling this for care_site and failed as the ID uses bigint:

Interestingly, this should be handled by the adapter (unless this is for something else...):

https://github.com/duckdb/dbt-duckdb/blob/41794e3edc2dfb4237bee0167e4df07f581e5ba6/dbt/adapters/duckdb/column.py#L18-L24

But does not seem to work!

What do you think @katy-sadowski ? I think doing the dbt run step before building the docs would be a way around this, and allow us to enforce schema checking in the future - but just using type is another valid solution

lawrenceadams commented 1 month ago

I've gone too deep into the weeds, but essentially if we want contract enforcement to work then the dbt-duckdb adapter needs to be updated to overload the DuckDBColumn class here with an updated TYPE_LABELS dict which would include mappings for 'BIGINT': 'INTEGER'.

I have tested it as the below and confirm this builds and the contract enforcer code is happy when this is present:

I do think it would be cool to have contractually bound type checking using the CDM docs as the source but this would require upstream dbt-duckdb changes, and I think BIGINT is thought to be a distinctly different type to INT (although I am not sure why, the contract feature allows type aliasing to be enabled or disabled at will.)

@katy-sadowski I will pull this out into a new issue and we can discuss there! 🧠

OHDSI / dbt-synthea