amundsen-io / amundsen

Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.
https://www.amundsen.io/amundsen/
Apache License 2.0
4.44k stars 960 forks source link

amundsendatabuilder -> HiveTableMetadataExtractor only works with mysql innodb #552

Closed pPanda-beta closed 4 years ago

pPanda-beta commented 4 years ago

This sql query only works with mysql innodb

https://github.com/lyft/amundsendatabuilder/blob/21a763add3c00c34b4f4c2d9809f59e50fb264c8/databuilder/extractor/hive_table_metadata_extractor.py#L28-L52

Expected Behavior

It extracts table info from metastore db

Current Behavior

It breaks when the syntax is not supported by the database engine

Possible Solution

First of all, HiveTableMetadataExtractor.SQL_STATEMENT should be configurable and customisable (metastore db type, hive version,...) Currently its hardocded https://github.com/lyft/amundsendatabuilder/blob/21a763add3c00c34b4f4c2d9809f59e50fb264c8/databuilder/extractor/hive_table_metadata_extractor.py#L66-L67

Here is a sample for postgres:

    SELECT source.*
    FROM (SELECT t."TBL_ID",
             d."NAME"         as "schema",
             t."TBL_NAME"     as name,
             t."TBL_TYPE",
             tp."PARAM_VALUE" as description,
             p."PKEY_NAME"    as col_name,
             p."INTEGER_IDX"  as col_sort_order,
             p."PKEY_TYPE"    as col_type,
             p."PKEY_COMMENT" as col_description,
             1              as is_partition_col,
             CASE WHEN t."TBL_TYPE" = 'VIRTUAL_VIEW' THEN 1 ELSE 0 END AS is_view
      FROM "TBLS" t
               JOIN "DBS" d ON t."DB_ID" = d."DB_ID"
               JOIN "PARTITION_KEYS" p ON t."TBL_ID" = p."TBL_ID"
               LEFT JOIN "TABLE_PARAMS" tp ON (t."TBL_ID" = tp."TBL_ID" AND tp."PARAM_KEY" = 'comment')

      {where_clause_suffix}
      UNION
      SELECT t."TBL_ID",
             d."NAME"         as "schema",
             t."TBL_NAME"     as name,
             t."TBL_TYPE",
             tp."PARAM_VALUE" as description,
             c."COLUMN_NAME"  as col_name,
             c."INTEGER_IDX"  as col_sort_order,
             c."TYPE_NAME"    as col_type,
             c."COMMENT"      as col_description,
             0              as is_partition_col,
             CASE WHEN t."TBL_TYPE" = 'VIRTUAL_VIEW' THEN 1 ELSE 0 END AS is_view
      FROM "TBLS" t
               JOIN "DBS" d ON t."DB_ID" = d."DB_ID"
               JOIN "SDS" s ON t."SD_ID" = s."SD_ID"
               JOIN "COLUMNS_V2" c ON s."CD_ID" = c."CD_ID"
               LEFT JOIN "TABLE_PARAMS" tp ON (t."TBL_ID" = tp."TBL_ID" AND tp."PARAM_KEY" = 'comment')

      {where_clause_suffix}
     ) source
    ORDER by "TBL_ID", is_partition_col desc;

Unfortunately there is no option to override it so I had to rewrite entire HiveTableMetadataExtractor class

(a few hacks are there, `HiveTableMetadataExtractor.SQL_STATEMENT = "My own sql as in python nothing is immutable " )

Steps to Reproduce

Run hive_sample_dag with hive metastore powered by postgres

Screenshots (if appropriate)

Context

Your Environment

feng-tao commented 4 years ago

sorry for the delay, here is the pr (https://github.com/lyft/amundsendatabuilder/pull/312) /

feng-tao commented 4 years ago

^^ @pPanda-beta you could pull the change to your test env to unblock.

pPanda-beta commented 4 years ago

@feng-tao : Due to huge amount of initial bootstrap works, we moved to amundsen with atlas instead of native. But I can say this will definitely allow people, to work with their own type of hive metastore database. Thanks for the PR.