datahub-project / datahub

The Metadata Platform for your Data Stack
https://datahubproject.io
Apache License 2.0
9.68k stars 2.86k forks source link

Incorrect input dataset platform for Redash Presto charts #3843

Open Eric-Xu opened 2 years ago

Eric-Xu commented 2 years ago

Describe the bug When using the Redash Connector (Datahub v0.8.21) with parse_table_names_from_sql: true, the platform value used to construct the upstream dataset URN does not take into account the data catalog type (e.g. Hive) when a dashboard or chart is pulling from a Presto data source.

To Reproduce Steps to reproduce the behavior:

  1. Run the ingestion Redash recipe with parse_table_names_from_sql: true. The Redash cluster should be configured where Presto is the query engine and Hive is the underlying data catalog.
  2. Once ingestion has finished, go to one of the ingested Redash charts from the Datahub UI and click on "Inputs". The platform for the input datasets is labeled as "Presto", but should actually be "Hive".

Expected behavior When ingesting dashboards and charts from Redash with parse_table_names_from_sql: true, if the data source type is Presto, then check what is the underlying data catalog (e.g. Hive) to be used as the platform value when constructing the upstream dataset URN.

Sample JSON response from Redash's /api/data_sources/ API endpoint being called here.

{
    "scheduled_queue_name": "scheduled_queries",
    "name": "presto-prod",
    "pause_reason": null, "queue_name": "queries",
    "syntax": "sql",
    "paused": 0,
    "options": {
        "catalog": **"hive"**,
        "schema": "default",
        "protocol": "http",
        "host": "presto-prod.abc.com",
        "port": 8889
    }, 
    "groups": {"2": false}, 
    "type": **"presto"**,
    "id": 1
}

In the example above, the upstream dataset platform should be set to hive and not presto.

maggiehays commented 2 years ago

Hi @taufiqibrahim! Would you mind taking a look at this one?

taufiqibrahim commented 2 years ago

Hi @maggiehays , okay let me take a look on this one.

taufiqibrahim commented 2 years ago

Hi @Eric-Xu @maggiehays

I've done some checks with Presto, and learn some concepts about Presto Catalog.

Basically, a catalog references a data source via a connector as defined here. So, it's just basically a name.

For Hive catalog/connector, the common default connector name is hive and stored in /etc/catalog/hive.properties configuration file inside Presto server. It's also allow us to have multiple Hive clusters as explained here.

So, if we have Hive connector/catalog using name other than hive, for example sales like describe on Presto doc above, it will be impossible to guess the platform correctly.

Any ideas about that?

maggiehays commented 2 years ago

Hi @taufiqibrahim! We recently saw this with our Metabase connector as well - take a look at how it was resolved in this PR

maggiehays commented 2 years ago

Hi @Eric-Xu & @taufiqibrahim - let me know if this is a fix you're able to push!

github-actions[bot] commented 2 years ago

This issue is stale because it has been open for 30 days with no activity. If you believe this is still an issue on the latest DataHub release please leave a comment with the version that you tested it with. If this is a question/discussion please head to https://slack.datahubproject.io. For feature requests please use https://feature-requests.datahubproject.io