datahub-project / datahub

The Metadata Platform for your Data Stack
https://datahubproject.io
Apache License 2.0
9.44k stars 2.8k forks source link

Ingesting a delta table fails with pyo3_runtime.PanicException #9180

Closed blaze225 closed 5 months ago

blaze225 commented 8 months ago

Describe the bug Ingesting a delta table using the ingestion template results in pyo3_runtime.PanicException

    ....
   pipeline.run()
  File "/usr/local/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 367, in run
    for wu in itertools.islice(
  File "/usr/local/lib/python3.10/site-packages/datahub/ingestion/api/source_helpers.py", line 143, in auto_workunit_reporter
    for wu in stream:
  File "/usr/local/lib/python3.10/site-packages/datahub/ingestion/source/delta_lake/source.py", line 353, in get_workunits_internal
    yield from self.process_folder(self.source_config.complete_path)
  File "/usr/local/lib/python3.10/site-packages/datahub/ingestion/source/delta_lake/source.py", line 312, in process_folder
    delta_table = read_delta_table(path, self.storage_options, self.source_config)
  File "/usr/local/lib/python3.10/site-packages/datahub/ingestion/source/delta_lake/delta_lake_utils.py", line 28, in read_delta_table
    return DeltaTable(
  File "/usr/local/lib/python3.10/site-packages/deltalake/table.py", line 250, in __init__
    self._table = RawDeltaTable(
pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value: InvalidHeaderValue

To Reproduce Steps to reproduce the behavior:

  1. Populate the delta lake ingestion template:
source:
  type: "delta-lake"
  config:
    base_path: ""
    env: "TEST"
    version_history_lookback: -1
    s3:
      aws_config:
        aws_access_key_id: ""
        aws_endpoint_url: ""
        aws_region: ""
        aws_proxy:
          http: ""
          https: ""
        aws_secret_access_key: ""
sink:
  type: "datahub-rest"
  config:
    server: ""
  1. Ingest using the SDK pipeline
from datahub.ingestion.run.pipeline import Pipeline
...
pipeline = Pipeline.create(<ingestion_template>)
pipeline.run()

Expected behavior Delta table is ingested without any errors.

Desktop (please complete the following information):

Additional context

hsheth2 commented 7 months ago

@blaze225 it looks like this exception is coming from the deltalake library

While I agree that we can do a better job with error handling in datahub, would you mind also filing a bug report against their repo https://github.com/delta-io/delta-rs

github-actions[bot] commented 6 months ago

This issue is stale because it has been open for 30 days with no activity. If you believe this is still an issue on the latest DataHub release please leave a comment with the version that you tested it with. If this is a question/discussion please head to https://slack.datahubproject.io. For feature requests please use https://feature-requests.datahubproject.io

github-actions[bot] commented 5 months ago

This issue was closed because it has been inactive for 30 days since being marked as stale.