datahub-project / datahub

The Metadata Platform for your Data and AI Stack
https://datahubproject.io
Apache License 2.0
9.98k stars 2.96k forks source link

Ingestion failed to complete, or completed with errors. postgres #11596

Open pricingblock-project opened 1 month ago

pricingblock-project commented 1 month ago

Describe the bug

Ingestion failed to complete, or completed with errors.

To Reproduce Steps to reproduce the behavior:

  1. Manager data sources-> create new source
  2. Add postgres recipe
  3. config the params
  4. click the finish and run button
  5. See error

Expected behavior running

Screenshots If applicable, add screenshots to help explain your problem. image

Desktop (please complete the following information):

OS: [e.g. ubuntu] Browser [e.g. chrome] Version [e.g. 22] chrome version: 120.0.6099.216, docker version: Docker version 24.0.7, build afdd53b, ubuntu version: Ubuntu 20.04.1 LTS

Additional context


** logs **

Execution finished with errors.
{'exec_id': 'c2584159-d0ad-483f-bceb-d2f664ea1fc5',
 'infos': ['2024-10-11 09:46:32.792952 INFO: Starting execution for task with name=RUN_INGEST',
           "2024-10-11 09:46:38.929017 INFO: Failed to execute 'datahub ingest', exit code 1",
           '2024-10-11 09:46:38.929279 INFO: Caught exception EXECUTING task_id=c2584159-d0ad-483f-bceb-d2f664ea1fc5, name=RUN_INGEST, '
           'stacktrace=Traceback (most recent call last):\n'
           '  File "/datahub-ingestion/.venv/lib/python3.10/site-packages/acryl/executor/execution/default_executor.py", line 139, in execute_task\n'
           '    task_event_loop.run_until_complete(task_future)\n'
           '  File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete\n'
           '    return future.result()\n'
           '  File "/datahub-ingestion/.venv/lib/python3.10/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 400, in '
           'execute\n'
           '    raise TaskError("Failed to execute \'datahub ingest\'")\n'
           "acryl.executor.execution.task.TaskError: Failed to execute 'datahub ingest'\n"],
 'errors': []}

~~~~ Ingestion Logs ~~~~
Obtaining venv creation lock...
Acquired venv creation lock
venv is already set up
venv setup time = 0 sec
This version of datahub supports report-to functionality
+ exec datahub ingest run -c /tmp/datahub/ingest/c2584159-d0ad-483f-bceb-d2f664ea1fc5/recipe.yml --report-to /tmp/datahub/logs/c2584159-d0ad-483f-bceb-d2f664ea1fc5/artifacts/ingestion_report.json
[2024-10-11 09:46:37,997] INFO     {datahub.cli.ingest_cli:149} - DataHub CLI version: 0.14.0.4
[2024-10-11 09:46:38,007] INFO     {datahub.ingestion.run.pipeline:255} - No sink configured, attempting to use the default datahub-rest sink.
[2024-10-11 09:46:38,032] INFO     {datahub.ingestion.run.pipeline:272} - Sink configured successfully. DataHubRestEmitter: configured to talk to http://datahub-gms:8080
[2024-10-11 09:46:38,424] ERROR    {datahub.entrypoints:218} - Command failed: Failed to find a registered source for type postgres: postgres is disabled; try running: pip install 'acryl-datahub[postgres]'
Traceback (most recent call last):
  File "/datahub-ingestion/.venv/lib/python3.10/site-packages/datahub/ingestion/api/registry.py", line 126, in _ensure_not_lazy
    plugin_class = import_path(path)
  File "/datahub-ingestion/.venv/lib/python3.10/site-packages/datahub/ingestion/api/registry.py", line 56, in import_path
    item = importlib.import_module(module_name)
  File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/datahub-ingestion/.venv/lib/python3.10/site-packages/datahub/ingestion/source/sql/postgres.py", line 6, in <module>
    import psycopg2  # noqa: F401
ModuleNotFoundError: No module named 'psycopg2'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/datahub-ingestion/.venv/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 136, in _add_init_error_context
    yield
  File "/datahub-ingestion/.venv/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 287, in __init__
    source_class = source_registry.get(self.source_type)
  File "/datahub-ingestion/.venv/lib/python3.10/site-packages/datahub/ingestion/api/registry.py", line 176, in get
    raise ConfigurationError(
datahub.configuration.common.ConfigurationError: postgres is disabled; try running: pip install 'acryl-datahub[postgres]'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/datahub-ingestion/.venv/lib/python3.10/site-packages/datahub/entrypoints.py", line 205, in main
    sys.exit(datahub(standalone_mode=False, **kwargs))
  File "/datahub-ingestion/.venv/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/datahub-ingestion/.venv/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/datahub-ingestion/.venv/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/datahub-ingestion/.venv/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/datahub-ingestion/.venv/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/datahub-ingestion/.venv/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/datahub-ingestion/.venv/lib/python3.10/site-packages/datahub/telemetry/telemetry.py", line 462, in wrapper
    raise e
  File "/datahub-ingestion/.venv/lib/python3.10/site-packages/datahub/telemetry/telemetry.py", line 411, in wrapper
    res = func(*args, **kwargs)
  File "/datahub-ingestion/.venv/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 203, in run
    ret = loop.run_until_complete(run_ingestion_and_check_upgrade())
  File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/datahub-ingestion/.venv/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 172, in run_ingestion_and_check_upgrade
    pipeline = Pipeline.create(
  File "/datahub-ingestion/.venv/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 417, in create
    return cls(
  File "/datahub-ingestion/.venv/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 284, in __init__
    with _add_init_error_context(
  File "/usr/lib/python3.10/contextlib.py", line 153, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/datahub-ingestion/.venv/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 140, in _add_init_error_context
    raise PipelineInitError(f"Failed to {step}: {e}") from e
datahub.ingestion.run.pipeline.PipelineInitError: Failed to find a registered source for type postgres: postgres is disabled; try running: pip install 'acryl-datahub[postgres]'
```

** Recipe**
```
run_id: 'urn:li:dataHubExecutionRequest:c2584159-d0ad-483f-bceb-d2f664ea1fc5'
source:
  type: postgres
  config:
    include_tables: true
    database: timeseries
    password: '${timeseries_188}'
    profiling:
      enabled: true
      profile_table_level_only: true
    host_port: '192.168.50.188:5432'
    include_views: true
    stateful_ingestion:
      enabled: true
    username: postgres
pipeline_name: 'urn:li:dataHubIngestionSource:89372b09-06a3-482f-8724-b0188479d56b'
```
pricingblock-project commented 1 month ago

another log

~~~~ Execution Summary - RUN_INGEST ~~~~
Execution finished with errors.
{'exec_id': 'cbfb6c8f-082b-49c5-9235-0487a0b7075d',
 'infos': ['2024-10-11 10:13:43.339194 INFO: Starting execution for task with name=RUN_INGEST',
           "2024-10-11 10:21:58.047478 INFO: Failed to execute 'datahub ingest', exit code 2",
           '2024-10-11 10:21:58.048656 INFO: Caught exception EXECUTING task_id=cbfb6c8f-082b-49c5-9235-0487a0b7075d, name=RUN_INGEST, '
           'stacktrace=Traceback (most recent call last):\n'
           '  File "/datahub-ingestion/.venv/lib/python3.10/site-packages/acryl/executor/execution/default_executor.py", line 139, in execute_task\n'
           '    task_event_loop.run_until_complete(task_future)\n'
           '  File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete\n'
           '    return future.result()\n'
           '  File "/datahub-ingestion/.venv/lib/python3.10/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 400, in '
           'execute\n'
           '    raise TaskError("Failed to execute \'datahub ingest\'")\n'
           "acryl.executor.execution.task.TaskError: Failed to execute 'datahub ingest'\n"],
 'errors': []}

~~~~ Ingestion Logs ~~~~
Obtaining venv creation lock...
Acquired venv creation lock
venv doesn't exist.. minting..
Using Python 3.10.12 interpreter at: /usr/bin/python3
Creating virtualenv at: /tmp/datahub/ingest/venv-postgres-3cbb1ad0ed8a0388
Resolved 3 packages in 3.87s
Prepared 3 packages in 42.25s
Installed 3 packages in 774ms
 + pip==24.2
 + setuptools==75.1.0
 + wheel==0.44.0
+ uv pip install 'acryl-datahub[datahub-rest,datahub-kafka,postgres]==0.14.1'
Resolved 170 packages in 1m 14s
error: Failed to prepare distributions
  Caused by: Failed to fetch wheel: pandas==2.2.3
  Caused by: Failed to extract archive
  Caused by: Failed to download distribution due to network timeout. Try increasing UV_HTTP_TIMEOUT (current value: 30s).

docker logs -f datahub-datahub-gms-1 |grep postgres

2024-10-11 10:52:15,221 [ForkJoinPool.commonPool-worker-5] INFO  c.l.m.entity.EntityServiceImpl:947 - Ingesting aspects batch to database: AspectsBatchImpl{items=[ChangeMCP{changeType=UPSERT, urn=urn:li:dataHubIngestionSource:ebb4e10e-6006-4790-9998-2ee55ff30a62, aspectName='dataHubIngestionSourceInfo', recordTemplate={name=timeseries_188, schedule={timezone=Asia/Shanghai, interval=0 0 * * *}, type=postgres, config={recipe={"source":{"type":"postgres","config":{"host_port":"192.168.50.188:5432","database":"timeseries","username":"postgres","include_tables":true,"incl..., systemMetadata={lastObserved=1728643935210, version=1, properties={appSource=ui}}}, ChangeMCP{changeType=CREATE, urn=urn:li:dataHubIngestionSource:ebb4e10e-6006-4790-9998-2ee55ff30a62, aspectName='dataHubIngestionSourceKey', recordTemplate={id=ebb4e10e-6006-4790-9998-2ee55ff30a62}, systemMetadata={lastObserved=1728643935210, version=1, properties={appSource=ui}}}]}
2024-10-11 10:52:17,397 [ForkJoinPool.commonPool-worker-13] INFO  c.l.m.entity.EntityServiceImpl:947 - Ingesting aspects batch to database: AspectsBatchImpl{items=[ChangeMCP{changeType=UPSERT, urn=urn:li:dataHubExecutionRequest:1c634b5b-0111-4328-a7fc-4f74dfe62ff7, aspectName='dataHubExecutionRequestInput', recordTemplate={args={recipe={"run_id":"urn:li:dataHubExecutionRequest:1c634b5b-0111-4328-a7fc-4f74dfe62ff7","source":{"type":"postgres","config":{"include_tables":true,"database":"timeseries","password":"${timeseries_188}","profiling":{"enabled":true,"profile_table_l..., systemMetadata={lastObserved=1728643937386, version=1, properties={appSource=ui}}}, ChangeMCP{changeType=CREATE, urn=urn:li:dataHubExecutionRequest:1c634b5b-0111-4328-a7fc-4f74dfe62ff7, aspectName='dataHubExecutionRequestKey', recordTemplate={id=1c634b5b-0111-4328-a7fc-4f74dfe62ff7}, systemMetadata={lastObserved=1728643937386, version=1, properties={appSource=ui}}}]}
Creating virtualenv at: /tmp/datahub/ingest/venv-postgres-3cbb1ad0ed8a0388
Creating virtualenv at: /tmp/datahub/ingest/venv-postgres-3cbb1ad0ed8a0388
Creating virtualenv at: /tmp/datahub/ingest/venv-postgres-3cbb1ad0ed8a0388
Creating virtualenv at: /tmp/datahub/ingest/venv-postgres-3cbb1ad0ed8a0388
Creating virtualenv at: /tmp/datahub/ingest/venv-postgres-3cbb1ad0ed8a0388
Creating virtualenv at: /tmp/datahub/ingest/venv-postgres-3cbb1ad0ed8a0388
Creating virtualenv at: /tmp/datahub/ingest/venv-postgres-3cbb1ad0ed8a0388
david-leifker commented 1 month ago

Please restart the actions pod and retry after checking your networking for access to the pypi repo. Caused by: Failed to download distribution due to network timeout.

sprybee commented 1 month ago

docker容器无法访问pypi仓库导致,可以通过 1、docker exec -it (datahub-actions:head的容器ID) /bin/bash 进入容器 2、手动执行:pip3 install psycopg2-binary -i https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple

jjoyce0510 commented 3 weeks ago

Hi folks - did this end up working? If yes I'll go ahead and close the ticket.