Open MattTriano opened 1 year ago
Per the feature-set comparisons on awesome-data-catalogs, it looks like my assessment about this product space was accurate; DataHub and OpenMetadata are the most feature-rich and developed options, but there's one more comparably feature-rich project: OpenDataDiscovery. That project only has 680 stars at the moment, is a month older than OpenMetadata, and it is growing much less rapidly than OpenMetadata or DataHub.
Looks like I'll have to upgrade docker-compose
to at least v2.0.0 to use OpenMetadata (which will involve updating makefile recipes to use docker compose
instead of docker-compose
, and per the compatibility docs, it looks like some commands I don't use have been. removed
Misc notes
DataHub Metadata Enrichment After ingesting metadata into DataHub, you can enrich metadata through the UI:
Enrich at source (e.g., via comments in SQL table definitions, or in meta blocks in dbt schema.yml files, in description fields in LookML dimension/metric definitions, etc)
Useful when there are patterns in the source data (e.g. common terms, field names, or concepts), example: any time there's a column whose name matches some regex, apply a given tag.
If you have a google doc or something defining ownership and definitions, you can ingest that (unclear how much config you have to do to parse the sheet)
For programmatic metadata (e.g., outputs from CI/CD processes)
The initial one shown where you add info through the UI
did you consider Amundsen? https://github.com/amundsen-io/amundsen - not entirely sure if it's considered a data catalog, but just a callout it has 3,700 stars
@bbrewington I gave it a brief look but the relatively modest amount of activity on the amundsen repo put it below DataHub and OpenMetadata on my list of things to check. I will confess, I couldn't get a great sense of the feature-sets of either of those tools from their websites and decided I'd just spin up test deployments for both and scan through the features. Here's my test setup of OpenMetadata and I'll probably spin up a DataHub test run tomorrow.
Have you used it? If so, what did you think of it? I checked through your repos and commits to see if you were a contributor but I didn't check too far. By the way, it looks like we've been looking at a lot of the docs and projects over the past few months, and I like the commit msgs on your dbt-BQ-info_schema repo.
@MattTriano haha sounds like the clickbait hooked you in (having some fun with that one) - here's link for reference: https://github.com/bbrewington/dbt-bigquery-information-schema
TBH I'm still pretty new to Metadata tools...actually the above linked repo might be a good use case to try some of these out. I assumed Amundsen was best in class, but now will consider the 3 against each other
A data catalog should have:
dbt's built-in doc server does include most of that functionality (even access control, apparently https://www.getdbt.com/blog/teaching-dbt-about-grants/), but it doesn't allow users to edit things through the portal, and I think it's intended more as a dev tool than a production option.
There are two options I want to evaluate:
I've looked at Amundsen, but its community is about 5% as active as OpenMetadata's community, and I don't think it will keep up.