Evaluate open source data catalog options for integration into this platform

MattTriano commented 1 year ago

A data catalog should have:

Document core metadata (table name, table description, table grain, source, etc),
Document table schema (column names, descriptions, data types, etc),
Lineage information,
Usage information,
Access control information,
Search functionality, and
the ability for users to enrich data with tags and further information.

dbt's built-in doc server does include most of that functionality (even access control, apparently https://www.getdbt.com/blog/teaching-dbt-about-grants/), but it doesn't allow users to edit things through the portal, and I think it's intended more as a dev tool than a production option.

There are two options I want to evaluate:

DataHub
- ~7k stars, project started in 2016, main open source option.
- features
OpenMetadata
- ~1.8k starts, project started in Aug 2021, growing faster than DataHub and has a slightly more active community.
- features

I've looked at Amundsen, but its community is about 5% as active as OpenMetadata's community, and I don't think it will keep up.

MattTriano commented 1 year ago

Per the feature-set comparisons on awesome-data-catalogs, it looks like my assessment about this product space was accurate; DataHub and OpenMetadata are the most feature-rich and developed options, but there's one more comparably feature-rich project: OpenDataDiscovery. That project only has 680 stars at the moment, is a month older than OpenMetadata, and it is growing much less rapidly than OpenMetadata or DataHub.

MattTriano commented 1 year ago

Looks like I'll have to upgrade docker-compose to at least v2.0.0 to use OpenMetadata (which will involve updating makefile recipes to use docker compose instead of docker-compose, and per the compatibility docs, it looks like some commands I don't use have been. removed

MattTriano commented 1 year ago

Misc notes

DataHub Metadata Enrichment After ingesting metadata into DataHub, you can enrich metadata through the UI:

Describe a data set, or even add description of columns,
Set the owner(s) of the dataset,
Add tags for the data set,
Add a glossary of terms
add a domain for the data set

Shift Left enrichment

Enrich at source (e.g., via comments in SQL table definitions, or in meta blocks in dbt schema.yml files, in description fields in LookML dimension/metric definitions, etc)

Transform Enrichment

Useful when there are patterns in the source data (e.g. common terms, field names, or concepts), example: any time there's a column whose name matches some regex, apply a given tag.

CSV: Bulk Enrichment Emport

If you have a google doc or something defining ownership and definitions, you can ingest that (unclear how much config you have to do to parse the sheet)

API Enrichment

For programmatic metadata (e.g., outputs from CI/CD processes)

DataHub UI

The initial one shown where you add info through the UI

bbrewington commented 1 year ago

did you consider Amundsen? https://github.com/amundsen-io/amundsen - not entirely sure if it's considered a data catalog, but just a callout it has 3,700 stars

MattTriano commented 1 year ago

@bbrewington I gave it a brief look but the relatively modest amount of activity on the amundsen repo put it below DataHub and OpenMetadata on my list of things to check. I will confess, I couldn't get a great sense of the feature-sets of either of those tools from their websites and decided I'd just spin up test deployments for both and scan through the features. Here's my test setup of OpenMetadata and I'll probably spin up a DataHub test run tomorrow.

Have you used it? If so, what did you think of it? I checked through your repos and commits to see if you were a contributor but I didn't check too far. By the way, it looks like we've been looking at a lot of the docs and projects over the past few months, and I like the commit msgs on your dbt-BQ-info_schema repo.

bbrewington commented 1 year ago

@MattTriano haha sounds like the clickbait hooked you in (having some fun with that one) - here's link for reference: https://github.com/bbrewington/dbt-bigquery-information-schema

TBH I'm still pretty new to Metadata tools...actually the above linked repo might be a good use case to try some of these out. I assumed Amundsen was best in class, but now will consider the 3 against each other

MattTriano / analytics_data_where_house