datahub-project / datahub

The Metadata Platform for your Data Stack
https://datahubproject.io
Apache License 2.0
9.8k stars 2.89k forks source link

OOMKilled for BigQuery ingestion from UI #11597

Open edulodgify opened 1 week ago

edulodgify commented 1 week ago

Describe the bug After doing some tests with the “default” setup of datahub for k8s, all installed with helm chart. We have observed some issues with BigQuery ingestion from the UI, All these tests have been done with a small dataset. We have tried to do the ingestion from CLI, since we will have to run it from third party tools like mage, for this case we have not had any problem, sometimes it is a little slow but it has always finished well. Note that we observed some warning with the message:

Cannot traverse scope _u_12.data_source with type '<class 'sqlglot.expressions.Column'>' but this has not affected the ingestion at any time and it has always finished. When we try to do the ingestion from the UI, we have not had any problem with tableu and dbt, but with BigQuery we see that it gets “stuck” shortly after starting and never finishes. Moreover, no matter how much we try to kill the process manually we can't get it to die, we see how it keeps increasing and increasing until it reaches the limit of the container and then it restarts.

image

The process is really stuck from almost the beginning, for the screenshot that we see the container was stuck at 8:16, in fact the last execution log that we see is the following one

[eef09359-6c70-4abf-942a-3131df168b88 logs] [2024-10-10 08:16:02,521] WARNING {sqlglot.optimizer.scope:548} - Cannot traverse scope _u_12.data_source with type '<class 'sqlglot.expressions.Column'>'

it seems that there is a memory leak or a task that is not well configured. We don't think it is due to lack of resources because when we have executed that same recipe manually from inside the container with the command

datahub ingest -c ./bigquery.yaml datahub UI uses command

datahub ingest run -c /tmp/datahub/ingest/513534ba-0e6a-4d1c-a71a-84efd17d50a1/recipe.yml --report-to /tmp/datahub/ingest/513534ba-0e6a-4d1c-a71a-84efd17d50a1/ingestion_report.json

this pipeline has finished without problems, in this screenshot you can see the resources consumed

image

Sink (datahub-rest) report:
{'total_records_written': 43,
'records_written_per_second': 0,
'warnings': [],
'failures': [],
'start_time': '2024-10-10 08:04:46.719552 (6 minutes and 5.01 seconds ago)',
'current_time': '2024-10-10 08:10:51.729373 (now)',
'total_duration_in_seconds': 365.01,
'max_threads': 15,
'gms_version': 'v0.14.0.2',
'pending_requests': 0,
'main_thread_blocking_timer': '0.063 seconds'}
Pipeline finished successfully; produced 43 events in 5 minutes and 59.12 seconds.
datahub@datahub-acryl-datahub-actions-6bc87bfd9b-d78vl:~$ exit

manual ingest vs UI ingest

image

To Reproduce Steps to reproduce the behavior:

  1. Go to ingestion
  2. Click on create new source
  3. Use the following YAML source: type: bigquery config: include_table_lineage: true include_usage_statistics: true include_tables: true include_views: true profiling: enabled: true profile_table_level_only: true stateful_ingestion: enabled: false credential: project_id: lodgify-datalab-1 private_key: "-----BEGIN PRIVATE KEY-----\nmysupersecurekey\n-----END PRIVATE KEY-----\n" private_key_id: privatekey client_email: datahub@random-project.iam.gserviceaccount.com client_id: '1111111111111111111111111' dataset_pattern: allow:
    • ^personio

Expected behavior task does not get stuck and the memory usage continues increasing until it reaches the limit and restarts the container

Environment:

david-leifker commented 1 week ago

One thing to try. Please run the ingestion via cli in the docker container from the same venv that the UI process uses. This venv would be in the /tmp/datahub/ingest/venv-<name of the source>-<other stuff>. Does the memory leak occur in this case?

david-leifker commented 1 week ago

Possibly related #11147

edulodgify commented 4 days ago

HI, i've reading issue https://github.com/datahub-project/datahub/issues/11147 and yes probably is the same issue. I'll add my logs anonymized in both tickets, just in case it could help to identify the memory issue acryl.log