Open datagero opened 5 days ago
Tool | Created by | Description | Unique Features | Integration | OpenLineage Integration |
---|---|---|---|---|---|
Amundsen | Lyft | An open-source data discovery and metadata engine for data-driven companies. | Automated metadata collection, integration with Apache Airflow, user-friendly UI, support for data lineage. | Integrates well with databases, data warehouses (e.g., BigQuery, Snowflake), Airflow, and other orchestration tools. | Can be integrated with OpenLineage via metadata APIs or custom connectors. |
DataHub | A metadata platform for data discovery, governance, and data lineage, designed to be scalable and robust. | Real-time data ingestion, metadata versioning, and strong integration with various data ecosystems. | Works with popular data platforms, orchestration tools (e.g., Airflow), and supports metadata standards. | Supports OpenLineage for lineage tracking and metadata collection, and integrates directly with its ecosystem. | |
Apache Atlas | Apache Software Foundation | An open-source metadata management and data governance framework. | Rich data classification, data governance, security, and integration with Hadoop ecosystem tools. | Strong integration with Hadoop, Hive, Kafka, and other Apache tools, as well as customizable via REST APIs. | Limited direct support; may require custom connectors or integration efforts for OpenLineage compatibility. |
OpenLineage | OpenLineage Community | An open standard for metadata and lineage collection across various data tools. | Standardized data lineage tracking, flexible integration with multiple platforms, and broad community support. | Natively integrates with DBT, Apache Airflow, Apache Spark, and other data processing tools. | N/A (It is the lineage standard itself, meant to be integrated by other tools like DataHub). |
Metacat | Netflix | A unified data catalog that provides data discovery, lineage, and metadata management. | Supports federated metadata management, lineage, and unified API for different data sources. | Integrates with various databases, data warehouses (e.g., Hive, Presto, RDS), and data lake storage solutions. | No direct OpenLineage integration; may require custom connectors for compatibility with OpenLineage. |
OpenMetadata | OpenMetadata Community | An open-source metadata management tool for data discovery, governance, and quality. | Advanced metadata management, schema versioning, flexible access control, and strong lineage capabilities. | Integrates with modern data stacks, including databases, data lakes, orchestration tools (e.g., Airflow), and BI tools. | Direct integration with OpenLineage for lineage tracking and metadata management, supports seamless data discovery. |
Key Features Needed for Managing Schemas, Metadata, and Data Quality Metrics:
Open-Source Tools that Align Best with Our Needs:
Ensuring Efficient Integration with Data Ingestion and Quality Control:
Given our requirements and priorities, we should start with OpenMetadata as our initial integration.
Why OpenMetadata?
We aim to keep the initial setup simple and efficient while maintaining flexibility for future growth and experimentation.
Objective:
Develop a centralized Data Catalogue to manage simple, non-evolving schemas, along with other metadata and data quality metrics. The solution should provide basic versioning, efficient retrieval, and seamless integration with existing tools, while maintaining open-source compatibility and usability.
Questions to Explore: