datahub-project / datahub

The Metadata Platform for your Data Stack
https://datahubproject.io
Apache License 2.0
9.67k stars 2.86k forks source link

OOM (Out of Memory Errors) in DataHub 0.13+ when ingesting Bigquery metadata #11147

Open vejeta opened 1 month ago

vejeta commented 1 month ago

Describe the bug

We have an ingestion job, running periodically in Kubernetes, it runs fine with DataHub 0.12.x versions.

0 12 0_Datahub

You can see the memory stays stable under 1GiB during the execution of the job.

However, with DataHub 0.13.x versions it always fail with Error 137, out of memory.

OOM_137

We have tried to increase the memory to 20GiB, but there must be a memory leak, because it alwasys run out of memory.

To Reproduce Steps to reproduce the behavior:

Note, we have also tried, the latest release from the latest commit from master, and the issue is still present.

Expected behavior Not having a Out of Memory error in DataHub 0.13.3

Screenshots If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

Additional context Logs with some summary after a successful execution with DataHub 0.12.x

Pipeline finished with at least 5 warnings; produced 33546 events in 3 hours, 26 minutes and 30.91 seconds.
{}
 'pending_requests': 0}
 'gms_version': 'v0.13.0',
 'total_duration_in_seconds': 12396.68,
 'current_time': '2024-08-11 03:26:39.897462 (now)',
 'start_time': '2024-08-11 00:00:03.217296 (3 hours, 26 minutes and 36.68 seconds ago)',
 'failures': [],
 'warnings': [],
 'records_written_per_second': 2,
{'total_records_written': 33546,
Sink (datahub-rest) report:
 'running_time': '3 hours, 26 minutes and 30.91 seconds'}
 'start_time': '2024-08-11 00:00:08.984329 (3 hours, 26 minutes and 30.91 seconds ago)',
 'stateful_usage_ingestion_enabled': True,
 'usage_end_time': '2024-08-11 00:00:08.942642+00:00 (3 hours, 26 minutes and 30.95 seconds ago)',
 'usage_start_time': '2024-08-10 00:00:00+00:00 (1 day, 3 hours and 26 minutes ago)',
                                       'extra={}',
                                       "successful lineage ingestion. and will not run lineage ingestion for same timestamps in subsequent run. '"
 'stateful_lineage_ingestion_enabled': "default=True description='Enable stateful lineage ingestion. This will store lineage window timestamps after "
 'lineage_end_time': '2024-08-11 00:00:08.942642+00:00 (3 hours, 26 minutes and 30.95 seconds ago)',
 'lineage_start_time': '2024-08-10 00:00:00+00:00 (1 day, 3 hours and 26 minutes ago)',
----
 'partition_info': {},
                      'sampled': '10 sampled of at most 3059 entries.'},
-----
 'num_usage_parsed_log_entries': {'*': 122613},
 'num_usage_total_log_entries': {'*': 175145},
 'lineage_metadata_entries': {'*': 99},
 'num_lineage_parsed_log_entries': {'*': 108696},
 'num_lineage_total_log_entries': {'*': 108696},
 'num_skipped_lineage_entries_other': {},
 'num_lineage_entries_sql_parser_failure': {'host: 147},
 'num_skipped_lineage_entries_not_allowed': {'host': 90366},
 'num_skipped_lineage_entries_missing_data': {'host': 14177},
 'num_total_lineage_entries': {'host': 108696},
 'audit_log_api_perf': {'get_exported_log_entries': '397.101 seconds', 'list_log_entries': None},
 'total_query_log_entries': 0,
 'sampled': '10 sampled of at most 2880 entries.'},
 'num_usage_parsed_log_entries': {'*': 122613},
 'num_usage_total_log_entries': {'*': 175145},
 'lineage_metadata_entries': {'*': 99},
 'num_lineage_parsed_log_entries': {'*': 108696},
 'num_lineage_total_log_entries': {'*': 108696},
 'num_skipped_lineage_entries_other': {},
 'num_lineage_entries_sql_parser_failure': {'host': 147},
 'num_skipped_lineage_entries_not_allowed': {'host': 90366},
 'num_skipped_lineage_entries_missing_data': {'host': 14177},
 'num_total_lineage_entries': {'host': 108696},
vejeta commented 3 weeks ago

Same behaviour in 0.14.x (tested in 0.14.0.1, 0.14.0.2)