`dbt docs` does not generate column types when used in workflows

lawrenceadams commented 1 month ago

We have added schema and type descriptions (yml) of the final OMOP tables from #61. This works as desired and is complaint to the dbt schema. When building the docs locally (usually after the project has been run) there are no issues, and the docs show the column types - however at present the github workflow only runs dbt docs without building the models - and so types are present in the documentation.

The official documentation only mentions use of data_type attribute - however, this attribute is for enforcing type checking when enabling the dbt contract feature
- this is arguably beyond the scope of what this project needs: but I think there is an argument to be made that this would be a cool feature to have to enforce that our output matches the OMOP CDM definitions.
- Unfortunately, by default this won't work as there are a few issues; for example:
  - row_number() returns a BIGINT
  - https://github.com/OHDSI/dbt-synthea/blob/ae791145d50c9e0693880ff9ed37d60f7cc0195d/models/omop/care_site.sql#L2
  - This causes type enforcement to fail at runtime if it were enabled.
To get the dbt docs feature to add types to the documentation by default it checks the type returned by the database it has been run against. If the models have not yet been built, then it will fall back to the type attribute.
- e.g.: https://github.com/OHDSI/dbt-synthea/blob/ae791145d50c9e0693880ff9ed37d60f7cc0195d/models/omop/_models/care_site.yml#L9

This issue is to track discussion about what we wish to do regarding this, we can either:

Build the project when building the docs, and we have a guaranteed output of what the types will be
Change the yml to use type instead of data_type so we don't have to build the project in CI.

I am a fan of 1, as it opens up the possibility of using contracts in the future - and using the type attribute is not documented anywhere (and violates the official yml definitions).

See this thread for origin of issue

katy-sadowski commented 1 month ago

Thanks @lawrenceadams for this investigation! Very interesting. I think that having type checking would be beneficial, and that we should pursue this via data_type and contracts. Regarding the BIGINT pickle, I think we could potentially just specify the datatypes as bigint in the schema.yml where relevant. Many institutions use bigint for ID columns in the OMOP CDM (mine included)

Is there any downside to building the project when building docs? I'd guess maybe runtime (which for the seed DB is quite fast, so prob not a big concern), and/or size limits for what we can store in GH Actions (but I just checked and it seems for public repos there's no storage limit for Actions artifacts - https://github.com/orgs/community/discussions/26438#discussioncomment-3251931). Anything else?

lawrenceadams commented 1 month ago

Agree @katy-sadowski!

Not really - it runs within about 30 seconds so there's little added cost. Potentially if the OHDSI github org has burnt through all their free minutes (but having a quick look it doesnt look like many repos use them at length, so I think we should be fine...)

katy-sadowski commented 1 month ago

Potentially if the OHDSI github org has burnt through all their free minutes (but having a quick look it doesnt look like many repos use them at length, so I think we should be fine...)

Our 30 sec run is a drop in the bucket compared to many repos which run extensive tests against multiple cloud DBs. So we should be good here.

lawrenceadams commented 1 month ago

Sweet! Will make the change later

OHDSI / dbt-synthea

`dbt docs` does not generate column types when used in workflows #75