datahub-project / datahub

The Metadata Platform for your Data Stack
https://datahubproject.io
Apache License 2.0
9.84k stars 2.9k forks source link

Duplicated database and table names #2896

Closed moravend closed 2 years ago

moravend commented 3 years ago

Hey guys, my company deploys databases in several countries and I was wondering how to solve the problem of duplicated database/table names?

For example, the MySQL database crm and table customers exists in both countries - Singapore (SG) and Malaysia (MY).

2 Methods that I can think of to differentiate them:

  1. Prepend the country code in front of the dataset identifier

urn:li:dataset:(urn:li:dataPlatform:mysql,sg.cms.customers,PROD) urn:li:dataset:(urn:li:dataPlatform:mysql,my.cms.customers,PROD)

However, this is not possible when using Metadata Ingestion MySQL helper to ingest metadata from MySQL.

  1. For the dataset urn, add another field for country_code behind origin

urn:li:dataset:(urn:li:dataPlatform:mysql,cms.customers,PROD,SG) urn:li:dataset:(urn:li:dataPlatform:mysql,cms.customers,PROD,MY)

This requires alteration of the DatasetKey.

Appreciate if anyone can assist me on this!

jjoyce0510 commented 3 years ago

Hi there! We totally agree we need a better way to model "domains" or bounded contexts of data. Ideally, we'd want to do this in a way that generalizes across all asset types (datasets, dashboards, data pipelines, etc). Explicitly modeling domains in DataHub is on our roadmap.

Let us see if we can do something quickly to unblock your integration, and we'll loop back with a more detailed estimation of when and how we will accomplish a the more robust solution detailed above. cc. @hsheth2

maggiehays commented 2 years ago

Hi folks, this is currently being addressed by @jjoyce0510's Domain & Conainter work in our Q1'22 roadmap; please refer to that source to stay up to date with progress. We'll close out this issue in the meantime!