dbt-labs / dbt-docs

Auto-generated data documentation site for dbt projects
Apache License 2.0
139 stars 74 forks source link

Create ERD from relationships in schema.yml files #84

Open noahbruegmann opened 4 years ago

noahbruegmann commented 4 years ago

For complicated table relationships, it's nice to have an ERD (i.e. schema diagram) to help understand how everything fits together.

Schema.yml files already have relationship tests built into them that show the relationship between tables, so those could be parsed to get the data needed for the "keys," which may not actually be keys in most warehouse systems.

A first implementation of this functionality might just look at the schema.yml file and get any columns listed there so that only one data source is needed. A fuller implementation would get all columns from the tables.

Other thoughts:

tayloramurphy commented 4 years ago

We're thinking about this too. dbt docs are great, but a lot of the data in there isn't for the right downstream audience. If you have business users that want to know what to query and where, having a clear ERD can be more useful than saying "go read the dbt docs".

I agree that this should just be done in the schema.yml files, probably using the ref syntax with some specific variables for PKs, FKs, etc.

aroder commented 3 years ago

We've tried a lot of solutions for this, none of which quite fit and none of which supported the schema.yml spec directly.

https://github.com/BurntSushi/erd, https://gojs.net/latest/index.html, https://pypi.org/project/ERAlchemy/, http://www.nomnoml.com/, https://dbdocs.io/

Right now, we have a functional tool that parses the schema.yml, specifically the relationships tests and columns to auto-generate an ERD

image

I've never worked in the dbt codebase, and there is a fair amount of work to make this something that will work generally. I'm interested in seeing this through, but would need some help

aroder commented 3 years ago

https://getdbt.slack.com/archives/C0VLNUUTZ/p1603828627482100

jtcohen6 commented 3 years ago

This is really, really cool!

Since dbt-docs is already built out of artifacts produced by dbt docs generate, namely manifest.json (which includes all tests and documentation) and catalog.json (which includes the set of all columns in the database), I think it would make sense to leverage the information available in those as inputs to the ERD.

E.g. grab the to/from components of all relationships tests in a v0.18.1-produced manifest:

jq '.nodes | .[] | select(.test_metadata.name=="relationships") | .test_metadata.kwargs' target/manifest.json
...
{
  "to": "ref('snowplow_web_page_context')",
  "field": "page_view_id",
  "column_name": "page_view_id",
  "model": "{{ ref('snowplow_web_events') }}"
}
...

So my questions are:

aroder commented 3 years ago

@jtcohen6 there is no benefit to parsing schema.yml directly. We don't have any experience working in the dbt docs code, so we just took the path of least resistance.

My preference would be to use the existing specification if possible. The column names, unique tests, and relationship tests provide the necessary metadata. If we are able to use the existing specification, then something like this would just work. Maybe we would need to add a toggle in dbt_project.yml to enable the feature, or something to that effect.

However, I can see a lot of edge cases that my initial assumptions are not good for. We stick to Kimball and star schemas pretty heavily, but what about people who do a normalized model and don't have a visually appealing star schema? What about the use case to see the entire data model in one visual?

tayloramurphy commented 3 years ago

@tayloramurphy It sounds like you'd want ERD components to be an additional schema.yml specification, rather than relying on unique + relationships tests to define PKs + FKs, respectively. Could you say more about your thinking there?

My first ideas for solutions typically lean more explicit than implicit. But if we're able to parse out relevant joins just based on testing I think it would make sense. Like @aroder mentioned there are corner cases on that.

Stuff like that. I think there are some clear easy wins / iterations that can happen on this which can certainly be improved over time. We're just doing things in LucidChart right now which makes me sad πŸ˜† https://about.gitlab.com/handbook/business-ops/data-team/platform/edw/#first-iteration-in-gitlab-dwh

RichardSharkey commented 3 years ago

This is a fantastic idea and would certainly decrease the fragmentation of documentation. I think defining the explicit configuration would be preferrable, if I was to add an opinion to the mix, since you could then define the relationship style for the ERD connecting lines.

I'm wondering if using the exposures concept would be a good vehicle to start from, for this requirement. Since you are already configuring inferred relationships between a subset of objects, it would make sense to define how they are related. It would also mean you don't need to worry about the "limiting" factor straight away, since the expectation would be that you want to see all objects defined in the exposure.

With regards to limiting columns in the tables, perhaps you could have max limit of 50 columns with a collapse/expand option to hide the excess columns interactively.

Also, imagine being able to hover over the column in the ERD and seeing a tooltip of the docs description or a link to docs page.

codigo-ergo-sum commented 3 years ago

I want this feature as much as anyone, and having said that, I'm not so sure that the data in the schema.yml files provides the full information needed to do this. A couple of specific items that you need to get in order to build an ERD that are not present in schema.yml files:

  1. Primary Keys. One might argue, "Well, at the least, uniqueness tests indicate the equivalent of an enforced unique index on a table, and all a Primary Key is is choosing which of the one or more unique columns present in a table is the actual PK if there are more than one. At the least we could indicate unique columns on the ERD even if we can't show a PK". Even that might be the case but the challenge is that in schema.yml files you can call out things that are not actual columns in the table for a uniqueness test (and for many things in general). E.g. "col1 || col2: or "SubStr(col1, 2, 4)". How would that be shown on the ERD and how would relationships be shown from another table to a table where one of these calculations exists as being shown as unique?

  2. Column datatypes. We would need to interrogate INFORMATION_SCHEMA or the equivalent for these. That is doable, it just leads to another layer of complexity for the code. What if the columns defined in INFORMATION_SCHEMA don't match up to what is in the .yml files?

noahbruegmann commented 3 years ago
  1. Primary Keys. One might argue, "Well, at the least, uniqueness tests indicate the equivalent of an enforced unique index on a table, and all a Primary Key is is choosing which of the one or more unique columns present in a table is the actual PK if there are more than one. At the least we could indicate unique columns on the ERD even if we can't show a PK". Even that might be the case but the challenge is that in schema.yml files you can call out things that are not actual columns in the table for a uniqueness test (and for many things in general). E.g. "col1 || col2: or "SubStr(col1, 2, 4)". How would that be shown on the ERD and how would relationships be shown from another table to a table where one of these calculations exists as being shown as unique?

To get around this, you could create some custom schema tests in a package that are reserved for things that you want to include in the ERD. They'd be custom tests that didn't really do anything interesting, but they could serve as a hook for helping decide what to pick up. This feels like it would be a nice balance between not having to define everything twice and still getting to a more explicit definition for the ERD, to address @tayloramurphy 's comment.

jtcohen6 commented 3 years ago

I'm wondering if using the exposures concept would be a good vehicle to start from, for this requirement. Since you are already configuring inferred relationships between a subset of objects, it would make sense to define how they are related. It would also mean you don't need to worry about the "limiting" factor straight away, since the expectation would be that you want to see all objects defined in the exposure.

I really like this. We could put an ERD right on the exposure's page in docs site.

Column datatypes. We would need to interrogate INFORMATION_SCHEMA or the equivalent for these. That is doable, it just leads to another layer of complexity for the code. What if the columns defined in INFORMATION_SCHEMA don't match up to what is in the .yml files?

We have access to column datatypes in catalog.json, populated by the dbt docs generate command, and they ultimately come from information schema or show/describe queries. I think that should be the source of truth.

eduard1987 commented 3 years ago

We wanted to know if you have any plans to release capabilities for data modeling with GUI to reverse and forward engineer DDL.

RickRen7575 commented 3 years ago

Hi we are also looking for this functionality! Would love to hear if there is a timeline on this.

panasenco commented 3 years ago

Also very interested in this functionality. While waiting for an official implementation, I created this PowerShell solution that parses manifest.json and produces a (honestly kinda ugly but readable) PlantUML diagram: https://gist.github.com/panasenco/31d4c12f0cadce91f576a818018970a0

blake-enyart commented 2 years ago

@jtcohen6, I'm interested in trying to go about developing the functionality here for injecting .png files into the markdown and attempting to have this be performed with the dbt docs generate command with perhaps some type of flag to indicate ERD generation.

Right now, I'm starting some initial prototyping with @aroder who has an early stage POC that generates the .png files from the schema.yml files via a short python script which identifies facts and dimensions based on naming conventions of fct_ and dim_. I know this isn't the ideal for a more permanent solution.

WIth that, do you have recommendations on how I might be able get started to either A) add in this existing script to the dbt core system or B) provide further guidance on what is needed to get started on this?

From there, I'm happy to carry on to the next phase of development that would leverage the manifest.json file like you recommended earlier up in this post.

Let me know what you think on next steps for this.

blake-enyart commented 2 years ago

@jtcohen6 @drewbanin (?) just wanted to follow-up on this and see if there are any insights on next steps for this.

joeyfezster commented 2 years ago

To the great dbt community - wondering if 2022 brings any update on this πŸ˜„

aroder commented 2 years ago

@joeyfezster I'm looking for collaborators to make this more robust. It's working for us but could use some work https://github.com/dbt-labs/dbt-docs/issues/84#issuecomment-717552239

noahbruegmann commented 2 years ago

@joeyfezster I'm looking for collaborators to make this more robust. It's working for us but could use some work #84 (comment)

@aroder we have some code (written for other purposes) that lets us load up the network file and comprehend the compiled graph. If you have the visualization side, we might be able to back it with data in the way that Jeremy suggested above. Is that the part that you're looking for help with?

brandco commented 2 years ago

I like this open source package for the R programing language: https://krlmlr.github.io/dm/ Not sure how hard it would be to translate the dbt yml to what the dm package can parse (I don't know dbt well, myself)

obar1 commented 1 year ago

I would solve adding a yml for tuning the docs generation with multiple options like so

datnguye commented 1 year ago

Hey folks, I just published first version CLI for generating ERD-as-code in DBML from manifest.json, which can be easily used together in the generating dbt docs pipeline.

https://github.com/datnguye/dbterd

Check it out! Feel free to let me know for any questions or requests. Cheers

github-actions[bot] commented 1 year ago

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

eitsupi commented 1 year ago

unstale

noahbruegmann commented 1 year ago

Thanks @eitsupi! Perhaps not surprising given that I opened the ticket initially, but I still think having a solution built into dbt would be great.

shawnchacko88 commented 7 months ago

This would be a great feature to update in dbt

Grism commented 3 months ago

This would be a fantastic addition to DBT

syedhanicf commented 2 weeks ago

A solution like this built into dbt would be awesome