datahub-project / datahub

The Metadata Platform for your Data and AI Stack
https://datahubproject.io
Apache License 2.0
10k stars 2.96k forks source link

Column tags displayed twice for column in dbt and postgresql dataset #10886

Open k-popov opened 4 months ago

k-popov commented 4 months ago

Describe the bug When viewing dataset of DBT + PostgreSQL (ingested separately and linked together over platform_instance recipe option) the same tag is displayed twice.

Clicking on either of No PDN tags causes datahub to request the same URN for tag (checked with browser inspector tool). Also checking response for getDataset graphql request issued by the page also shows that there is only one tag assigned to the column:

{
  "data": {
    "dataset": {
      "schemaMetadata": {"aspectVersion":…},
      "editableSchemaMetadata": {
        "editableSchemaFieldInfo": [
          {"fieldPath":…},
          {"fieldPath":…},
          {"fieldPath":…},
          {"fieldPath":…},
          {"fieldPath":…},
          {"fieldPath":…},
          {"fieldPath":…},
          {"fieldPath":…},
          {"fieldPath":…},
          {"fieldPath":…},
          {"fieldPath":…},
          {"fieldPath":…},
          {"fieldPath":…},
          {"fieldPath":…},
          {
            "fieldPath": "call_tries_count",
            "description": null,
            "globalTags": {
              "tags": [
                {
                  "tag": {
                    "urn": "urn:li:tag:No PDN",
                    "type": "TAG",
                    "name": "No PDN",
                    "description": "Description",
                    "properties": {"name":…},
                    "__typename": "Tag"
                  },
                  "associatedUrn": "VALID_DATASET_URN_HERE",
                  "__typename": "TagAssociation"
                }
              ],
              "__typename": "GlobalTags"
            },
            "glossaryTerms": null,
            "__typename": "EditableSchemaFieldInfo"
          },
          {"fieldPath":…},
          {"fieldPath":…},
          {"fieldPath":…},
          {"fieldPath":…},
          {"fieldPath":…}
        ],
        "__typename": "EditableSchemaMetadata"
      },
      "__typename": "Dataset",
      "siblings": {"isPrimary":…}
    }
  },
  "extensions": {}
}

Datahub 0.13.2 running in Kubernetes.

To Reproduce Steps to reproduce the behavior:

  1. Set up two ingestion sources: PostgreSQL and DBT with same platform_instance (PostgreSQL set up in UI, DBT is CLI) and ingest corresponding data
  2. Assign tag to column via UI
  3. Browse datasets
  4. See behavior on screenshot, same tag is displayed twice both in columns list and column detail on the right.

Expected behavior Tag is displayed only once

Screenshots image

Desktop (please complete the following information):

k-popov commented 4 months ago

Got another facts making me think this is because the dataset is a combination of DBT and PostgreSQL. Below is the screenshot of "combined" view:

dh_double_tags_combined

Meanwhile this is the same table in DBT:

dh_double_tags_dbt

And the same in PostgreSQL:

dh_double_tags_psql
So for some of the fields tags are the following: DBT PostgreSQL Combined
id PDN PDN
call_break PDN NoPDN PDN, NoPDN
call_tries_count NoPDN NoPDN NoPDN, NoPDN

Notice that tags in "combined" view (not sure how to call it correctly) are rather "concatenated" than "merged". If the same tag (not only same text but also same URN) is set in both, the "combined" contains them both. For different tags (call_break column) this seems to be correct behaviour (though semantically it's not correct which is the data markup mistake). But if the tags are the same, only one should be left.

k-popov commented 4 months ago

Posted a suggested workaround for the issue: #10964 . It does the job, no duplicate tags are shown but may probably break something else. It requires a review (or may be even rewrite) from a person more familiar with this part of datahub.