apache / gravitino

World's most powerful open data catalog for building a high-performance, geo-distributed and federated metadata lake.
https://gravitino.apache.org
Apache License 2.0
927 stars 299 forks source link

[FEATURE] [Lineage] Support data lineage manage #4711

Open zixi0825 opened 3 weeks ago

zixi0825 commented 3 weeks ago

Describe the feature

Gravitinio manages metadata from various types of data sources, providing a solid foundation for data lineage, which plays a crucial role in data development and governance. Data lineage is essential for tasks such as asset popularity calculation, impact analysis, and attribution analysis.

The implementation of data lineage management primarily involves two aspects:

  1. Lineage Relationship Storage and Querying
    There are two main approaches to storing and querying lineage relationships: 1) Custom-built Storage and Query Engine: The advantage of this approach is greater control and fewer dependencies, but it requires a significant investment in human resources. 2) Integration with Existing Open-Source Systems: In the current open-source ecosystem, OpenMetadata, DataHub, and Apache Atlas all offer capabilities for lineage relationship storage and querying. Among these, OpenMetadata and DataHub have more active communities compared to Atlas, leading to faster project iterations. Gravitinio can choose to integrate with existing metadata management systems, leveraging their storage and query capabilities. This module can be made pluggable, allowing users to select different implementations and paving the way for Gravitinio to develop its own storage and query components in the future.

  2. Lineage Relationship Acquisition
    Currently, lineage relationships are acquired mainly through the following methods: 1) Parsing SQL to Obtain Lineage Relationships between Tables and Columns: This is commonly done using tools like Antlr or Calcite. This method is more general-purpose but requires adaptation for each type of data source. 2) Using Engine's Hook Mechanism to Output Lineage Relationships: The platform receives the lineage data and stores it. This method provides more accurate data lineage but is somewhat intrusive.

In my view, data lineage is an indispensable feature for a data asset management platform. Gravitinio can gradually implement this functionality.

Motivation

No response

Describe the solution

No response

Additional context

No response

Jiayi-Liao commented 1 week ago

Another interesting idea is, you can integrate lineage ability with LLM like GPT-4o, which is extremely helpful when building a complex system involving different developers with different backgrounds.

For example, data engineers may not fully understand the long training pipeline and online model inference pipeline, but with LLM, he can easily ask questions like "What bussiness will be affected if I drop this column", and LLM will infer the whole lineage pipeline and get the final result.