datahub-project / datahub

The Metadata Platform for your Data Stack
https://datahubproject.io
Apache License 2.0
9.44k stars 2.8k forks source link

Vertica urns lack database name #10387

Open heyromnivan opened 2 months ago

heyromnivan commented 2 months ago

with Vertica URNs don't contain database name

In my case I'm trying to build a joint lineage between Vertica and dbt, and they don't connect. If I understand correctly, it's because tables described by dbt have urn of urn:li:dataPlatform:vertica,dbaname.schema.table, but tables ingested from Vertica have urns of urn:li:dataPlatform:vertica,schema.table.

Originally posted by @heyromnivan in https://github.com/datahub-project/datahub/issues/5483#issuecomment-2079250670

github-actions[bot] commented 1 month ago

This issue is stale because it has been open for 30 days with no activity. If you believe this is still an issue on the latest DataHub release please leave a comment with the version that you tested it with. If this is a question/discussion please head to https://slack.datahubproject.io. For feature requests please use https://feature-requests.datahubproject.io

heyromnivan commented 1 month ago

0.13.1

It does look like an issue to me as this makes Vertica basically incompatible with any other metadata source. Even though Vertica itself doesn't allow multiple databases, it still has a database concept and external tools (dbt, BI tools) are all designed to take db name into account when constructing urns.

The only way I found is to make a custom source extending VerticaSource and overriding get_identifier method.

from datahub.ingestion.source.sql.vertica import VerticaSource, VerticaConfig
from vertica_sqlalchemy_dialect.base import VerticaInspector

@platform_name("Vertica")
@config_class(VerticaConfig)
# copy here all the decorators from the latest version of VerticaSource
class MyVerticaSource(VerticaSource):
    def get_identifier(self, *, schema: str, entity: str, inspector: VerticaInspector, **kwargs) -> str:
        db_name = self.get_db_name(inspector)
        return f'{db_name}.{schema}.{entity}'

This can only be used with CLI ingestion which cannot be scheduled or run through DataHub UI, so it has to be automated with some external tool.

github-actions[bot] commented 1 week ago

This issue is stale because it has been open for 30 days with no activity. If you believe this is still an issue on the latest DataHub release please leave a comment with the version that you tested it with. If this is a question/discussion please head to https://slack.datahubproject.io. For feature requests please use https://feature-requests.datahubproject.io