datahub-project / datahub

The Metadata Platform for your Data Stack
https://datahubproject.io
Apache License 2.0
9.67k stars 2.86k forks source link

BigQuery ingestion suceed but datasets not shown in UI #10872

Closed Haebuk closed 2 months ago

Haebuk commented 2 months ago

Hello,

I am encountering an issue with DataHub where BigQuery ingestion is reported as successful, but the BigQuery platform and datasets are not visible in the DataHub UI. Other platforms like MongoDB are ingested successfully and visible in the UI.

{ "cli": { "cli_version": "0.13.1.2", "py_version": "3.10.13", "os_details": "Linux-5.10.213-201.855.amzn2.x86_64-x86_64-with-glibc2.36", "mem_info": "396.41 MB", "peak_memory_usage": "396.41 MB", "disk_info": { "total": "53.67 GB", "used": "21.77 GB", "free": "31.9 GB" }, "peak_disk_usage": "21.86 GB", "thread_count": 1, "peak_thread_count": 16 }, "source": { "type": "bigquery", "report": { "num_tables_fetch_sample_values_failed": 0, "num_tables_classification_attempted": 0, "num_tables_classification_failed": 0, "num_tables_classification_found": 0, "window_end_time": "2024-07-09 08:22:39.181089+00:00", "window_start_time": "2024-07-08 00:00:00+00:00", "events_produced": 14952, "events_produced_per_sec": 20, "entities": { "container": [ "sampled of 96 total elements"], "dataset": [ "sampled of 1559 total elements"], "tag": ["urn:li:tag:CLUSTERING_COLUMN_1", "urn:li:tag:CLUSTERING_COLUMN_2", "urn:li:tag:CLUSTERING_COLUMN_3", "urn:li:tag:CLUSTERING_COLUMN_4", "urn:li:tag:PARTITION_KEY"] }, "aspects": { "container": { "containerProperties": 96, "status": 96, "dataPlatformInstance": 96, "subTypes": 96, "browsePathsV2": 96, "container": 95 }, "dataset": { "status": 1559, "schemaMetadata": 1545, "datasetProperties": 1545, "container": 1545, "dataPlatformInstance": 1545, "subTypes": 1545, "browsePathsV2": 1545, "viewProperties": 19, "datasetProfile": 1433, "operation": 1637, "datasetUsageStatistics": 275, "upstreamLineage": 179 }, "tag": { "tagKey": 5 } }, "warnings": { "profile skipped as partitioned table is empty or partition id or type was invalid": ["sampled of 93 total elements"] }, "failures": {}, "soft_deleted_stale_entities": ["sampled of 14 total elements"], "tables_scanned": 1526, "views_scanned": 19, "entities_profiled": 1433, "filtered": ["sampled of 370 total elements"], "num_view_definitions_parsed": 19, "num_view_definitions_failed_parsing": 0, "num_view_definitions_failed_column_parsing": 0, "view_definitions_parsing_failures": [], "profiling_skipped_not_updated": {}, "profiling_skipped_size_limit": { "top 10 of total 30 entries": 97 }, "profiling_skipped_row_limit": { "top 10 of total 22 entries": 40 }, "profiling_skipped_other": {}, "num_tables_not_eligible_profiling": {}, "num_total_lineage_entries": {"": 240}, "num_skipped_lineage_entries_missing_data": {}, "num_skipped_lineage_entries_not_allowed": {"": 24}, "num_lineage_entries_sql_parser_failure": {}, "num_skipped_lineage_entries_other": {}, "num_lineage_total_log_entries": {"": 240}, "num_lineage_parsed_log_entries": {"": 240}, "num_lineage_log_parse_failures": {}, "lineage_failed_extraction": [], "lineage_metadata_entries": {"": 163}, "lineage_mem_size": {"": "72.4 KB"}, "lineage_extraction_sec": {"": 24.77}, "usage_extraction_sec": {"": 43.97}, "num_usage_total_log_entries": {"": 21738}, "num_usage_parsed_log_entries": {"": 21721}, "usage_error_count": {}, "num_usage_resources_dropped": 368, "num_usage_operations_dropped": 149, "operation_dropped": [], "usage_failed_extraction": [], "num_project_datasets_to_scan": {"": 97}, "metadata_extraction_sec": { "top 10 of total 95 entries": 129.98999999999998 }, "include_table_lineage": true, "use_date_sharded_audit_log_tables": false, "log_page_size": 1000, "use_exported_bigquery_audit_metadata": false, "log_entry_start_time": "2024-07-09 07:08:59.489000+00:00", "log_entry_end_time": "2024-07-09 08:37:39.181089+00:00", "audit_start_time": "2024-07-08 23:45:00+00:00", "audit_end_time": "2024-07-09 08:37:39.181089+00:00", "upstream_lineage": {}, "partition_info": {}, "profile_table_selection_criteria": {}, "selected_profile_tables": {}, "profiling_skipped_invalid_partition_ids": {}, "profiling_skipped_invalid_partition_type": {}, "profiling_skipped_partition_profiling_disabled": [], "num_usage_workunits_emitted": 275, "total_query_log_entries": 0, "num_read_events": 10732, "num_query_events": 10989, "num_view_query_events": 9, "num_view_query_events_failed_sql_parsing": 0, "num_view_query_events_failed_table_identification": 0, "num_filtered_read_events": 17, "num_filtered_query_events": 0, "num_usage_query_hash_collisions": 0, "num_operational_stats_workunits_emitted": 1637, "snapshots_scanned": 0, "read_reasons_stat": {"JOB": 10732}, "operation_types_stat": { "SELECT": 7823, "CREATE_TABLE_AS_SELECT": 1297, "MERGE": 2815, "DELETE": 28, "INSERT": 24, "ALTER_TABLE": 2, "CREATE_TABLE": 4, "TRUNCATE_TABLE": 2 }, "schema_api_perf": { "list_projects": "0.600 seconds", "list_datasets": "0.821 seconds", "get_columns_for_dataset": "70.436 seconds", "get_tables_for_dataset": "298.142 seconds", "list_tables": "10.798 seconds", "get_views_for_dataset": "105.928 seconds", "get_snapshots_for_dataset": "103.109 seconds" }, "audit_log_api_perf": { "list_log_entries": "33.518 seconds" }, "processing_perf": { "sql_parsing_sec": "0.552 seconds", "store_usage_event_sec": "6.489 seconds", "usage_state_size": "{'main': '44.1 MB', 'queries': '13.28 MB'}" }, "lineage_start_time": "2024-07-09 07:23:59.489000+00:00", "lineage_end_time": "2024-07-09 08:22:39.181089+00:00", "stateful_lineage_ingestion_enabled": true, "usage_start_time": "2024-07-09 00:00:00+00:00", "usage_end_time": "2024-07-09 08:22:39.181089+ "usage_end_time": "2024-07-09 08:22:39.181089+00:00", "stateful_usage_ingestion_enabled": true, "start_time": "2024-07-09 08:22:39.198077", "running_time": "12 minutes and 13.31 seconds" } }, "sink": { "type": "datahub-rest", "report": { "total_records_written": 14952, "records_written_per_second": 20, "warnings": [], "failures": [], "start_time": "2024-07-09 08:22:32.441719", "current_time": "2024-07-09 08:34:52.516074", "total_duration_in_seconds": 740.07, "max_threads": 15, "gms_version": "v0.10.4", "pending_requests": 0 } } }

[earlier logs truncated...] [2024-07-09 08:32:55,675] INFO {datahub.ingestion.source.bigquery_v2.bigquery:711} - Starting profiling project [2024-07-09 08:32:55,676] INFO {datahub.ingestion.source.bigquery_v2.bigquery_report:160} - Time spent in stage <: Metadata Extraction at 2024-07-09 08:22:39.951665+00:00>: 615.72 seconds

[2024-07-09 08:33:03,712] INFO {datahub.ingestion.source.bigquery_v2.bigquery_report:160} - Time spent in stage <: Profiling at 2024-07-09 08:32:55.676658+00:00>: 8.04 seconds [2024-07-09 08:33:03,774] INFO {datahub.ingestion.source.bigquery_v2.usage:880} - Start loading log entries from BigQuery for with start_time=2024-07-08 23:45:00+00:00 and end_time=2024-07-09 08:37:39.181089+00:00 [2024-07-09 08:33:06,900] INFO {datahub.ingestion.source.bigquery_v2.bigquery_audit_log_api:126} - Loaded 0 log entries from GCP Log for [2024-07-09 08:33:07,189] WARNING {py.warnings:109} - /tmp/datahub/ingest/venv-bigquery-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/source/bigquery_v2/bigquery_audit.py:185: DeprecationWarning: Call to deprecated class method create_from_string. (prefer .from_string) dataset_urn = DatasetUrn.create_from_string(urn)

[2024-07-09 08:33:07,191] WARNING {py.warnings:109} - /tmp/datahub/ingest/venv-bigquery-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/source/bigquery_v2/bigquery_audit.py:186: DeprecationWarning: Call to deprecated method get_dataset_name. (Use .name instead) -- Deprecated since version 0.12.0.2. split = dataset_urn.get_dataset_name().rsplit(".", 3)

[2024-07-09 08:33:08,961] INFO {datahub.ingestion.source.bigquery_v2.bigquery_audit_log_api:126} - Loaded 1000 log entries from GCP Log for [2024-07-09 08:33:10,684] INFO {datahub.ingestion.source.bigquery_v2.bigquery_audit_log_api:126} - Loaded 2000 log entries from GCP Log for [2024-07-09 08:33:12,679] INFO {datahub.ingestion.source.bigquery_v2.bigquery_audit_log_api:126} - Loaded 3000 log entries from GCP Log for [2024-07-09 08:33:14,632] INFO {datahub.ingestion.source.bigquery_v2.bigquery_audit_log_api:126} - Loaded 4000 log entries from GCP Log for [2024-07-09 08:33:16,618] INFO {datahub.ingestion.source.bigquery_v2.bigquery_audit_log_api:126} - Loaded 5000 log entries from GCP Log for [2024-07-09 08:33:18,881] INFO {datahub.ingestion.source.bigquery_v2.bigquery_audit_log_api:126} - Loaded 6000 log entries from GCP Log for [2024-07-09 08:33:20,909] INFO {datahub.ingestion.source.bigquery_v2.bigquery_audit_log_api:126} - Loaded 7000 log entries from GCP Log for [2024-07-09 08:33:22,892] INFO {datahub.ingestion.source.bigquery_v2.bigquery_audit_log_api:126} - Loaded 8000 log entries from GCP Log for [2024-07-09 08:33:24,381] INFO {datahub.ingestion.source.bigquery_v2.bigquery_audit_log_api:126} - Loaded 9000 log entries from GCP Log for [2024-07-09 08:33:25,898] INFO {datahub.ingestion.source.bigquery_v2.bigquery_audit_log_api:126} - Loaded 10000 log entries from GCP Log for [2024-07-09 08:33:27,540] INFO {datahub.ingestion.source.bigquery_v2.bigquery_audit_log_api:126} - Loaded 11000 log entries from GCP Log for [2024-07-09 08:33:29,629] INFO {datahub.ingestion.source.bigquery_v2.bigquery_audit_log_api:126} - Loaded 12000 log entries from GCP Log for [2024-07-09 08:33:31,660] INFO {datahub.ingestion.source.bigquery_v2.bigquery_audit_log_api:126} - Loaded 13000 log entries from GCP Log for [2024-07-09 08:33:33,618] INFO {datahub.ingestion.source.bigquery_v2.bigquery_audit_log_api:126} - Loaded 14000 log entries from GCP Log for [2024-07-09 08:33:35,748] INFO {datahub.ingestion.source.bigquery_v2.bigquery_audit_log_api:126} - Loaded 15000 log entries from GCP Log for [2024-07-09 08:33:37,759] INFO {datahub.ingestion.source.bigquery_v2.bigquery_audit_log_api:126} - Loaded 16000 log entries from GCP Log for [2024-07-09 08:33:39,464] INFO {datahub.ingestion.source.bigquery_v2.bigquery_audit_log_api:126} - Loaded 17000 log entries from GCP Log for [2024-07-09 08:33:41,468] INFO {datahub.ingestion.source.bigquery_v2.bigquery_audit_log_api:126} - Loaded 18000 log entries from GCP Log for [2024-07-09 08:33:44,196] INFO {datahub.ingestion.source.bigquery_v2.bigquery_audit_log_api:126} - Loaded 19000 log entries from GCP Log for [2024-07-09 08:33:45,851] INFO {datahub.ingestion.source.bigquery_v2.bigquery_audit_log_api:126} - Loaded 20000 log entries from GCP Log for [2024-07-09 08:33:47,391] INFO {datahub.ingestion.source.bigquery_v2.bigquery_audit_log_api:126} - Loaded 21000 log entries from GCP Log for [2024-07-09 08:33:47,678] INFO {datahub.ingestion.source.bigquery_v2.bigquery_audit_log_api:137} - Finished loading log entries from GCP Log for [2024-07-09 08:33:47,679] INFO {datahub.ingestion.source.bigquery_v2.usage:483} - Total number of events aggregated = 21345. [2024-07-09 08:33:47,679] INFO {datahub.ingestion.source.bigquery_v2.usage:486} - Total number of read events generated = 0. [2024-07-09 08:33:48,300] INFO {datahub.ingestion.source.bigquery_v2.bigquery_report:160} - Time spent in stage <: Usage Extraction Ingestion at 2024-07-09 08:33:03.713020+00:00>: 44.59 seconds

[2024-07-09 08:34:04,271] INFO {datahub.ingestion.source.bigquery_v2.bigquery_report:160} - Time spent in stage <: Usage Extraction Operational Stats at 2024-07-09 08:33:48.300941+00:00>: 15.97 seconds [2024-07-09 08:34:11,073] INFO {datahub.ingestion.source.bigquery_v2.bigquery_report:160} - Time spent in stage <: Usage Extraction Usage Aggregation at 2024-07-09 08:34:04.271522+00:00>: 6.8 seconds [2024-07-09 08:34:11,074] INFO {datahub.ingestion.source.bigquery_v2.lineage:337} - Generate lineage for [2024-07-09 08:34:11,074] INFO {datahub.ingestion.source.bigquery_v2.lineage:713} - Entering create lineage map function [2024-07-09 08:34:11,075] INFO {datahub.ingestion.source.bigquery_v2.lineage:640} - Populating lineage info via exported GCP audit logs [2024-07-09 08:34:11,140] INFO {datahub.ingestion.source.bigquery_v2.lineage:643} - Start loading log entries from BigQuery for with start_time=2024-07-09 07:08:59.489000+00:00 and end_time=2024-07-09 08:37:39.181089+00:00 [2024-07-09 08:34:13,271] INFO {datahub.ingestion.source.bigquery_v2.bigquery_audit_log_api:126} - Loaded 0 log entries from GCP Log for [2024-07-09 08:34:35,839] INFO {datahub.ingestion.source.bigquery_v2.bigquery_audit_log_api:137} - Finished loading log entries from GCP Log for [2024-07-09 08:34:35,839] INFO {datahub.ingestion.source.bigquery_v2.lineage:798} - Exiting create lineage map function [2024-07-09 08:34:35,839] INFO {datahub.ingestion.source.bigquery_v2.lineage:358} - Built lineage map containing 163 entries. [2024-07-09 08:34:36,174] INFO {datahub.ingestion.source.state.checkpoint:145} - Successfully constructed last checkpoint state for job ingest_from_bigquery_source with timestamp 2024-07-09 07:46:57.329000+00:00 [2024-07-09 08:34:36,175] INFO {datahub.ingestion.source.state.stale_entity_removal_handler:180} - Soft-deleting stale entity - urn:li:dataset:(urn:li:dataPlatform:bigquery,....,PROD) [2024-07-09 08:34:36,176] INFO {datahub.ingestion.source.state.stale_entity_removal_handler:180} - Soft-deleting stale entity - urn:li:dataset:(urn:li:dataPlatform:bigquery,....,PROD) [2024-07-09 08:34:36,184] INFO {datahub.ingestion.source.state.stale_entity_removal_handler:180} - Soft-deleting stale entity - urn:li:dataset:(urn:li:dataPlatform:bigquery,....,PROD) [2024-07-09 08:34:36,207] INFO {datahub.ingestion.source.state.stale_entity_removal_handler:180} - Soft-deleting stale entity - urn:li:dataset:(urn:li:dataPlatform:bigquery,....,PROD) [2024-07-09 08:34:36,216] INFO {datahub.ingestion.source.state.stale_entity_removal_handler:180} - Soft-deleting stale entity - urn:li:dataset:(urn:li:dataPlatform:bigquery,....,PROD) [2024-07-09 08:34:36,226] INFO {datahub.ingestion.source.state.stale_entity_removal_handler:180} - Soft-deleting stale entity - urn:li:dataset:(urn:li:dataPlatform:bigquery,....,PROD) [2024-07-09 08:34:36,231] INFO {datahub.ingestion.source.state.stale_entity_removal_handler:180} - Soft-deleting stale entity - urn:li:dataset:(urn:li:dataPlatform:bigquery,....,PROD) [2024-07-09 08:34:36,231] INFO {datahub.ingestion.source.state.stale_entity_removal_handler:180} - Soft-deleting stale entity - urn:li:dataset:(urn:li:dataPlatform:bigquery,....,PROD) [2024-07-09 08:34:36,231] INFO {datahub.ingestion.source.state.stale_entity_removal_handler:180} - Soft-deleting stale entity - urn:li:dataset:(urn:li:dataPlatform:bigquery,....,PROD) [2024-07-09 08:34:36,232] INFO {datahub.ingestion.source.state.stale_entity_removal_handler:180} - Soft-deleting stale entity - urn:li:dataset:(urn:li:dataPlatform:bigquery,....,PROD) [2024-07-09 08:34:36,232] INFO {datahub.ingestion.source.state.stale_entity_removal_handler:180} - Soft-deleting stale entity - urn:li:dataset:(urn:li:dataPlatform:bigquery,....,PROD) [2024-07-09 08:34:36,232] INFO {datahub.ingestion.source.state.stale_entity_removal_handler:180} - Soft-deleting stale entity - urn:li:dataset:(urn:li:dataPlatform:bigquery,....,PROD) [2024-07-09 08:34:36,233] INFO {datahub.ingestion.source.state.stale_entity_removal_handler:180} - Soft-deleting stale entity - urn:li:dataset:(urn:li:dataPlatform:bigquery,....,PROD) [2024-07-09 08:34:36,233] INFO {datahub.ingestion.source.state.stale_entity_removal_handler:180} - Soft-deleting stale entity - urn:li:dataset:(urn:li:dataPlatform:bigquery,....,PROD) [2024-07-09 08:34:49,132] INFO {datahub.ingestion.run.pipeline:504} - Processing commit request for DatahubIngestionCheckpointingProvider. Commit policy = CommitPolicy.ON_NO_ERRORS, has_errors=False, has_warnings=True [2024-07-09 08:34:51,911] INFO {datahub.ingestion.run.pipeline:524} - Successfully committed changes for DatahubIngestionCheckpointingProvider. [2024-07-09 08:34:51,916] INFO {datahub.ingestion.reporting.file_reporter:52} - Wrote SUCCESS report successfully to <_io.TextIOWrapper name='/tmp/datahub/ingest/09f6853a-d4a2-4e39-a93e-6879ec25f7ca/ingestion_report.json' mode='w' encoding='UTF-8'> [2024-07-09 08:34:51,917] INFO {datahub.cli.ingest_cli:141} - Finished metadata ingestion



-**ScreenShots**:
- Although it is currently running during the re-execution attempt, the logs indicate that it has successfully completed
![image](https://github.com/datahub-project/datahub/assets/68543150/25ec0358-19a2-4c23-86f9-1f1eef4e8daa)

- Only Mongodb Platform Shown
![image](https://github.com/datahub-project/datahub/assets/68543150/a749531f-3e34-45f5-86cb-ce49653f86c9)

I would appreciate any advice on how to resolve this issue.

Thank you.
treff7es commented 2 months ago

Please, can you check the ingestion summary in the logs? It should show you how many dataset were ingested and also it should show you some sample dataset which was ingested.

Haebuk commented 2 months ago

I resolved this issue (but I don't know why) so closed it.