datahub-project / datahub

The Metadata Platform for your Data Stack
https://datahubproject.io
Apache License 2.0
9.44k stars 2.8k forks source link

Ingestion via UI of BigQuery metadata #10571

Open ivanaorefice opened 1 month ago

ivanaorefice commented 1 month ago

Describe the bug I'm trying to ingest all metadata that are stored in BigQuery. Data in Bigquery are stored as follows:

I have the project, which I would call (just because I have business constraint) my-test-project Inside BigQuery, I have 26 dataset, with lots of tables inside, and I want to ingest metadata from every table belonging to those tables. Names of the dataset belonging are in the form: published_dataset1, published_dataset2 and so on. I'm ingesting via UI, writing a recipe like this one:

      project_id: my_test_project
      dataset_pattern:
      allow:
       - my_test_project.published_dataset1
      deny: []
      ignoreCase: true
      table_pattern:
      allow:
       - '.*'
      deny: []
      ignoreCase: true

I wrote one recipe for every single dataset and it works every single time, except for one dataset, named published_kpi. I have absolutely no clue why it doesn't work: i tried to play with table_pattern field, by trying to ingest one single table at a time, and it doesn't work. I tried to play with different configuration of the recipe, for example by substituing dataset_pattern with schema_pattern, and still doesn't work. Then, I tried to change the name of the dataset from published_kp (wrong, but intentional) and it doesn't work. Finally I changed to published_kpii (again, wrong but intentional) and it works: the ingestion returns only the project because, of course, there's no published_kpii in my BigQuery to ingest. This is why I think this is a bug, but I have no idea on how to resolve it. The bug is that, basically, the ingestion from UI starts but runs indefinetely. I tried to launch it and let it go overnight, thinking that maybe the problem was related to the amount of metadata of the tables belonging to the dataset, but the following morning the ingestion was still in status Running.

To Reproduce Sorry, I can't give steps to reproduce this error, because this is very specific on how the dataplatform is made.

Expected behavior The ingestion completes correctly in a finite amount of time. I expect no more than 5 minutes for the amount of tables and metadata to ingest, but actually i just want it to be ingested,

Screenshots I can't report any screen here because I can't show name of the projects, but the ingestion log stops when trying to ingest the dataset, here: [START LOG] [2024-05-22 13:51:04,526] INFO {datahub.ingestion.source.sql.sql_config:106} - Applying table_pattern {'allow': ['.*'], 'deny': [], 'ignoreCase': False} to view_pattern. [2024-05-22 13:51:04,527] WARNING {datahub.ingestion.source.bigquery_v2.bigquery_config:315} - project_id_pattern is not set but project_id is set, source will only ingest the project_id project. project_id will be deprecated, please use project_id_pattern instead. [2024-05-22 13:51:04,674] INFO {datahub.ingestion.source.state.checkpoint:145} - Successfully constructed last checkpoint state for job BigQuery_skip_redundant_run_usage with timestamp 2024-05-22 13:48:26.572000+00:00 [2024-05-22 13:51:04,674] INFO {datahub.ingestion.source.state.redundant_run_skip_handler:221} - BigQuery_skip_redundant_run_usage : Last run start, end times:TimeWindow(start_time=datetime.datetime(2024, 5, 21, 0, 0, tzinfo=datetime.timezone.utc), end_time=datetime.datetime(2024, 5, 22, 13, 47, 32, 270000, tzinfo=datetime.timezone.utc)) [2024-05-22 13:51:04,674] INFO {datahub.ingestion.source.state.redundant_run_skip_handler:221} - BigQuery_skip_redundant_run_usage : Reducing time window. Updating start time to 2024-05-22 13:47:32.270000+00:00. [2024-05-22 13:51:04,674] INFO {datahub.ingestion.source.state.redundant_run_skip_handler:221} - BigQuery_skip_redundant_run_usage : Adjusted start, end times: (2024-05-22 00:00:00+00:00, 2024-05-22 13:51:04.526892+00:00) [2024-05-22 13:51:04,676] INFO {datahub.ingestion.run.pipeline:255} - Source configured successfully. [2024-05-22 13:51:04,677] INFO {datahub.cli.ingest_cli:128} - Starting metadata ingestion [2024-05-22 13:51:04,678] INFO {datahub.ingestion.source.bigquery_v2.bigquery:590} - Getting projects [2024-05-22 13:51:04,678] INFO {datahub.ingestion.source.bigquery_v2.bigquery:570} - Processing project: my_test_project [END LOG]

Once again, I specify that recipe works with every dataset.

Desktop (please complete the following information):

OS: Windows 11 Browser Chrome Version: Datahub version is 0.13 Additional context [Add any other context about the problem here.]

ivanaorefice commented 1 month ago

Update, still running from yesterday:

running

ioreficedatareply commented 3 weeks ago

Update: I'll post some screen here where I tried to ingest one table belonging to published_kpi; I moved that table to another dataset, and the ingestion works out successfully.

This is the script we run to move one table from one dataset to another:

create table tlp-dataplatform-test.published_tolL.CB_TREND_BUSINESS as select * from tlp-dataplatform-test.published_kpi.CB_TREND_BUSINESS

Immagine 2024-06-13 121124

Immagine 2024-06-13 121206

Immagine 2024-06-13 121225

And, finally, the table ingested in Datahub:

Immagine 2024-06-13 121527

ioreficedatareply commented 2 weeks ago

Anybody has any suggestion??