MarquezProject / marquez

Collect, aggregate, and visualize a data ecosystem's metadata
https://marquezproject.ai
Apache License 2.0
1.78k stars 320 forks source link

No datasets found and Only one job is showed #2951

Closed DonghaLim closed 3 weeks ago

DonghaLim commented 3 weeks ago

Hello, This is dongha. I have a question about marquez connection with airflow. I use airflow 2.7.3 and marquez is executed in EKS system. marquez web is working well and it is showed dag list that airflow is run.

By the way, I can't see any datasets and only one job is showed. Is there anything that I need to check?

Please let me know

thanks.

I can't upload any photos but I attached event payload in events menu

{ "eventType":string"COMPLETE" "eventTime":string"2024-10-25T09:05:44.803117Z" "run":{2 items "runId":string"f3cc61c0-7c19-36d0-9e22-3a2864..." "facets":{2 items "nominalTime":null "parent":null } } "job":{3 items "namespace":string"airflow" "name":string"biz_daily" "facets":{4 items "documentation":null "sourceCodeLocation":null "sql":null "jobType":null } } "inputs":[]0 items "outputs":[]0 items "producer":string"https://github.com/apache/airf..." "schemaURL":string"https://openlineage.io/spec/1-..." }

boring-cyborg[bot] commented 3 weeks ago

Thanks for opening your first issue in the Marquez project! Please be sure to follow the issue template!

phixMe commented 3 weeks ago

What airflow operators are you using to run your task? The datasets, if available, would be in either the inputs or outputs properties of the OpenLineage event... I would ping the OpenLineage slack community. @mobuchowski is the leading expert on the Airflow/OpenLineage integration side.

DonghaLim commented 3 weeks ago

We are you using some kinds of Operators like below.

  1. PythonOperator
  2. BashOperator
  3. PostgresOperator
  4. SparkSubmitOperator
  5. EmptyOperator
  6. SQLExecuteQueryOperator
  7. AthenaOperator
  8. BaseOperator

How can I add inputs or outputs ? As I checked in document, It seems that we can add "inlets or outlets" Additionally, we are using "apache-airflow-providers-openlineage" for 1.2.0.

DonghaLim commented 3 weeks ago

When I checked with supported operators in document, PythonOperator and BashOperator is supported from 1.4.0 Although I upgraded apache-airflow-providers-openlineage" for 1.4.0, it didn't work now.

DonghaLim commented 3 weeks ago

[2024-10-28, 08:49:55 UTC] {configuration.py:1050} WARNING - section/key [openlineage/disabled_for_operators] not found in config [2024-10-28, 08:49:55 UTC] {manager.py:105} WARNING - Failed to extract metadata using found extractor <airflow.providers.openlineage.extractors.bash.BashExtractor object at 0x7f19c454f550> - section/key [openlineage/disabled_for_operators] not found in config task_type=BashOperator airflow_dag_id=biz_energy_service_daily_0.0.1 task_id=clean_output_path_conformed_daily_dim_energy_service_installed_apps_sparse airflow_run_id=scheduled__2024-10-24T00:00:00+00:00 [2024-10-28, 08:49:55 UTC] {configuration.py:1050} WARNING - section/key [openlineage/config_path] not found in config [2024-10-28, 08:49:55 UTC] {utils.py:408} WARNING - section/key [openlineage/config_path] not found in config

DonghaLim commented 3 weeks ago

[2024-10-28, 08:51:48 UTC] {base.py:152} WARNING - OpenLineage provider method failed to extract data from provider. [2024-10-28, 08:51:48 UTC] {configuration.py:1050} WARNING - section/key [openlineage/config_path] not found in config [2024-10-28, 08:51:48 UTC] {utils.py:408} WARNING - section/key [openlineage/config_path] not found in config

mobuchowski commented 3 weeks ago

@DonghaLim This page https://airflow.apache.org/docs/apache-airflow-providers-openlineage/stable/supported_classes.html shows the supported Operators.

Since you're using apache-airflow-providers-openlineage, for the future valid repo to open issues would be https://github.com/apache/airflow/ - Marquez visualizes data based on events it receives, and can't deal with just the lack the data in the events.

From 2.10 we added a feature where the lineage can be gathered from Airflow Hooks, even if you use operators that are not supported directly. This is a feature we'll develop more, and more hooks will be supported over time. This won't work on 2.7.3, so you'll not get dataset data from PythonOperator on that version.

Additionally, I'd recommend to use latest released version that's compatible with your Airflow version - at least 1.7.0 version of OpenLineage provider fixes the warning logs you've posted recently.

If you have any questions feel free to ask them on OpenLineage slack, OpenLineage issues or Airflow issues/discussions - I'll close this issue as it's not the relevant place to talk about Airflow integration.