OHDSI / dbt-synthea

[Under development] A dbt ETL project to convert a Synthea synthetic data set into the OMOP CDM
https://ohdsi.github.io/dbt-synthea/
Apache License 2.0
15 stars 6 forks source link

`dbt docs` does not generate column types when used in workflows #75

Closed lawrenceadams closed 4 weeks ago

lawrenceadams commented 1 month ago

We have added schema and type descriptions (yml) of the final OMOP tables from #61. This works as desired and is complaint to the dbt schema. When building the docs locally (usually after the project has been run) there are no issues, and the docs show the column types - however at present the github workflow only runs dbt docs without building the models - and so types are present in the documentation.

  1. The official documentation only mentions use of data_type attribute - however, this attribute is for enforcing type checking when enabling the dbt contract feature
  2. To get the dbt docs feature to add types to the documentation by default it checks the type returned by the database it has been run against. If the models have not yet been built, then it will fall back to the type attribute.

This issue is to track discussion about what we wish to do regarding this, we can either:

  1. Build the project when building the docs, and we have a guaranteed output of what the types will be
  2. Change the yml to use type instead of data_type so we don't have to build the project in CI.

I am a fan of 1, as it opens up the possibility of using contracts in the future - and using the type attribute is not documented anywhere (and violates the official yml definitions).


See this thread for origin of issue

katy-sadowski commented 1 month ago

Thanks @lawrenceadams for this investigation! Very interesting. I think that having type checking would be beneficial, and that we should pursue this via data_type and contracts. Regarding the BIGINT pickle, I think we could potentially just specify the datatypes as bigint in the schema.yml where relevant. Many institutions use bigint for ID columns in the OMOP CDM (mine included)

Is there any downside to building the project when building docs? I'd guess maybe runtime (which for the seed DB is quite fast, so prob not a big concern), and/or size limits for what we can store in GH Actions (but I just checked and it seems for public repos there's no storage limit for Actions artifacts - https://github.com/orgs/community/discussions/26438#discussioncomment-3251931). Anything else?

lawrenceadams commented 1 month ago

Agree @katy-sadowski!

Not really - it runs within about 30 seconds so there's little added cost. Potentially if the OHDSI github org has burnt through all their free minutes (but having a quick look it doesnt look like many repos use them at length, so I think we should be fine...)

katy-sadowski commented 1 month ago

Potentially if the OHDSI github org has burnt through all their free minutes (but having a quick look it doesnt look like many repos use them at length, so I think we should be fine...)

Our 30 sec run is a drop in the bucket compared to many repos which run extensive tests against multiple cloud DBs. So we should be good here.

lawrenceadams commented 1 month ago

Sweet! Will make the change later