datacontract / datacontract-cli

CLI to manage your datacontract.yaml files
https://cli.datacontract.com
Other
355 stars 63 forks source link

datacontract import --format dbt #104

Open simonharrer opened 3 months ago

simonharrer commented 3 months ago

Out of #103 came the idea of having an import of dbt models to a datacontract.yaml

datacontract import --format dbt models.yaml
emirkmo commented 3 months ago

I already do something like this for import, creating a datacontract.yaml given a dbt project, but was using the "schema" field instead of the"models" field, with a custom schema type. (Slightly off-topic, but schema was much more widely understandable than models in our workshops. Just some feedback I can provide on its depreciation in the specification).

However our code is/was quite specific to the format of the dbt projects we allowed. To do it properly, one would want to parse & use the manifest.json file from a dbt project. It is the most straightforward way of working with dbt projects generically.

You would go into dbt Nodes in the manifest, and for every resource_type of model import the columns, data_types if given, descriptions if given, etc. The only difficulty is mapping the data_types to the supported ones in datacontract spec. Hence why physical model specific schema might make more sense for the import.. As a first step though, the model in models could just not provide the data_type or provide the dbt one if it matches.

(For parsing the manifest, Dagster-dbt does this as well, and the code is Apache-2 Licensed, if you are looking for inspiration). The import is something I can contribute on, if the implementation sounds ok.


Much easier of course is to be pointed to a dbt schema.yaml file, and using that for importing the models. Anything not defined in that yaml file would be missed. Then again, maybe that's ok.

pixie79 commented 1 month ago

I think the later is fine. As I presume most people with more than a few dbt models split them into a model per file otherwise it gets quite unwieldily very quickly. Either that or parse them all but allow an input to specify which models you want to include in the data contract as it could be you want to or three for a specific contract?

emirkmo commented 1 month ago

I think the later is fine. As I presume most people with more than a few dbt models split them into a model per file otherwise it gets quite unwieldily very quickly.

This does not match my experience with larger dbt projects. But one or several models can logically co exist and be part a data contract so it is fine anyway? (It’s reasonable to ask/expect to not mix models from different data products/contracts..)

torbenkeller commented 3 weeks ago

I'm looking into this right now

simonharrer commented 3 weeks ago

Awesome! I assigned you the issue. :-)

jochenchrist commented 2 days ago

@torbenkeller any progress here?

teoria commented 15 hours ago

i've been working with dbt, maybe I can help

torbenkeller commented 14 hours ago

@jochenchrist Was working on other things the last weeks, sorry. But I will continue on this.

@teoria sounds good, if you want we can pair program to get this ready

torbenkeller commented 1 hour ago

@teoria you can contact me on the datacontract slack server