datahub-project / datahub

The Metadata Platform for your Data and AI Stack
https://datahubproject.io
Apache License 2.0
9.93k stars 2.94k forks source link

[metadata-ingestion][dbt] DBTColumn dataclass does not ensure data types #11825

Open igorvoltaic opened 1 week ago

igorvoltaic commented 1 week ago

https://github.com/datahub-project/datahub/blob/a559c7ee5f5d8f66dca3154621b7a42c881d5281/metadata-ingestion/src/datahub/ingestion/source/dbt/dbt_common.py#L467 @dataclass does not ensure data types

https://github.com/datahub-project/datahub/blob/a559c7ee5f5d8f66dca3154621b7a42c881d5281/metadata-ingestion/src/datahub/ingestion/source/sql/sql_types.py#L231 this results in regex failure in resolve_trino_modified_type

[2024-10-26 08:00:09,013] ERROR    {datahub.entrypoints:205} - Command failed: expected string or bytes-like object
Traceback (most recent call last):
  File "/tmp/site-packages/datahub/entrypoints.py", line 192, in main
    sys.exit(datahub(standalone_mode=False, **kwargs))
  File "/tmp/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/tmp/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/tmp/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/tmp/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/tmp/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/tmp/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/tmp/site-packages/datahub/telemetry/telemetry.py", line 454, in wrapper
    raise e
  File "/tmp/site-packages/datahub/telemetry/telemetry.py", line 403, in wrapper
    res = func(*args, **kwargs)
  File "/tmp/site-packages/datahub/cli/ingest_cli.py", line 201, in run
    ret = loop.run_until_complete(run_ingestion_and_check_upgrade())
  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/tmp/site-packages/datahub/cli/ingest_cli.py", line 185, in run_ingestion_and_check_upgrade
    ret = await ingestion_future
  File "/tmp/site-packages/datahub/cli/ingest_cli.py", line 139, in run_pipeline_to_completion
    raise e
  File "/tmp/site-packages/datahub/cli/ingest_cli.py", line 131, in run_pipeline_to_completion
    pipeline.run()
  File "/tmp/site-packages/datahub/ingestion/run/pipeline.py", line 407, in run
    for wu in itertools.islice(
  File "/tmp/site-packages/datahub/ingestion/api/source_helpers.py", line 160, in auto_stale_entity_removal
    for wu in stream:
  File "/tmp/site-packages/datahub/ingestion/api/source_helpers.py", line 184, in auto_workunit_reporter
    for wu in stream:
  File "/tmp/site-packages/datahub/ingestion/api/source_helpers.py", line 277, in auto_browse_path_v2
    for urn, batch in _batch_workunits_by_urn(stream):
  File "/tmp/site-packages/datahub/ingestion/api/source_helpers.py", line 444, in _batch_workunits_by_urn
    for wu in stream:
  File "/tmp/site-packages/datahub/ingestion/api/source_helpers.py", line 201, in auto_materialize_referenced_tags_terms
    for wu in stream:
  File "/tmp/site-packages/datahub/ingestion/api/source_helpers.py", line 104, in auto_status_aspect
    for wu in stream:
  File "/tmp/site-packages/datahub/ingestion/source/dbt/dbt_common.py", line 970, in get_workunits_internal
    yield from self.create_dbt_platform_mces(
  File "/tmp/site-packages/datahub/ingestion/source/dbt/dbt_common.py", line 1254, in create_dbt_platform_mces
    aspects = self._generate_base_dbt_aspects(
  File "/tmp/site-packages/datahub/ingestion/source/dbt/dbt_common.py", line 1563, in _generate_base_dbt_aspects
    schema_metadata = self.get_schema_metadata(self.report, node, mce_platform)
  File "/tmp/site-packages/datahub/ingestion/source/dbt/dbt_common.py", line 1626, in get_schema_metadata
    or get_column_type(
  File "/tmp/site-packages/datahub/ingestion/source/dbt/dbt_common.py", line 808, in get_column_type
    TypeClass = resolve_trino_modified_type(column_type)
  File "/tmp/site-packages/datahub/ingestion/source/sql/sql_types.py", line 232, in resolve_trino_modified_type
    match = re.match(r"([a-zA-Z]+)\(.+\)", type_string)
  File "/usr/local/lib/python3.10/re.py", line 190, in match
    return _compile(pattern, flags).match(string)
TypeError: expected string or bytes-like object