datahub-project / datahub

The Metadata Platform for your Data Stack
https://datahubproject.io
Apache License 2.0
9.87k stars 2.92k forks source link

Redshift Ingestion broken in 0.13.2 #10435

Open joshua-pgatour opened 6 months ago

joshua-pgatour commented 6 months ago

Describe the bug


Execution finished with errors.
{'exec_id': '2236cd44-90eb-4781-9563-05ea51f4bbd4',
 'infos': ['2024-05-03 19:27:34.726045 INFO: Starting execution for task with name=RUN_INGEST',
           '2024-05-03 19:55:40.923833 INFO: Caught exception EXECUTING task_id=2236cd44-90eb-4781-9563-05ea51f4bbd4, name=RUN_INGEST, '
           'stacktrace=Traceback (most recent call last):\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/default_executor.py", line 140, in execute_task\n'
           '    task_event_loop.run_until_complete(task_future)\n'
           '  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete\n'
           '    return future.result()\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 282, in execute\n'
           '    raise TaskError("Failed to execute \'datahub ingest\'")\n'
           "acryl.executor.execution.task.TaskError: Failed to execute 'datahub ingest'\n"],
 'errors': ['2024-05-03 19:55:40.923635 ERROR: The ingestion process was killed by signal SIGKILL likely because it ran out of memory. You can '
            'resolve this issue by allocating more memory to the datahub-actions container.']}

I have tried to increase my memory up to 32gb and I still get this error.  I've turned off Lineage and Profiling and even tried just tables, no views.  

Occassionally I will get this error instead of the memory one:

  File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/api/source_helpers.py", line 391, in _batch_workunits_by_urn
    for wu in stream:
  File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/api/source_helpers.py", line 184, in auto_materialize_referenced_tags
    for wu in stream:
  File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/api/source_helpers.py", line 91, in auto_status_aspect
    for wu in stream:
  File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/source/redshift/redshift.py", line 468, in get_workunits_internal
    yield from self.extract_lineage(
  File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/source/redshift/redshift.py", line 987, in extract_lineage
    lineage_extractor.populate_lineage(
  File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/source/redshift/lineage.py", line 659, in populate_lineage
    table_renames, all_tables_set = self._process_table_renames(
  File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/source/redshift/lineage.py", line 872, in _process_table_renames
    all_tables[database][schema].add(prev_name)
KeyError: 'pgat_competitions_x'

In this case it seems to be trying to reference a schema name that I have filtered out in the ingest recipe.
hsheth2 commented 6 months ago

@joshua-pgatour what CLI version is this with?

I merged a fix related to this in https://github.com/datahub-project/datahub/pull/9967

joshua-pgatour commented 6 months ago

Thank you for the reply. I have tried going back one version at a time on datahub-actions dockerhub releases and the KeyError seems to stop happening around v10. However, I still have a memory issue. I have pretty much maxed out the size my pod can be and it still fails with memory SIGKILL. Any suggestions on getting around this? Here is my current recipe:

`source: type: redshift config: host_port: '' database: pgat username: table_lineage_mode: mixed include_table_lineage: false include_tables: true include_views: false profiling: enabled: false profile_table_level_only: false stateful_ingestion: enabled: true password: '${redshift_secret2}' schema_pattern: allow:

`

joshua-pgatour commented 6 months ago

So I figured out how to change the CLI version in the ingest recipe. I'm sorry I thought it was controlled by the datahub-actions container version. (Didn't know it was controlled in the recipe). 0.10.5.1 CLI works fine on 16gb memory and there is no KeyError. I will experiment at what point this breaks. But I gotta believe there's a memory leak in newer versions.

joshua-pgatour commented 6 months ago

I can confirm that v0.13.2 has the memory problem. v0.13.1 works, but in my testing the ingest process has slowed significantly since 0.12

github-actions[bot] commented 4 months ago

This issue is stale because it has been open for 30 days with no activity. If you believe this is still an issue on the latest DataHub release please leave a comment with the version that you tested it with. If this is a question/discussion please head to https://slack.datahubproject.io. For feature requests please use https://feature-requests.datahubproject.io

hsheth2 commented 4 months ago

@joshua-pgatour regarding memory utilization - this should help https://github.com/datahub-project/datahub/pull/10691

If the issue persists, it'd be helpful to have a memory profile generated as per https://datahubproject.io/docs/metadata-ingestion/docs/dev_guides/profiling_ingestions/