datahub-project / datahub

The Metadata Platform for your Data and AI Stack
https://datahubproject.io
Apache License 2.0
9.93k stars 2.94k forks source link

Refresh of editableSchemaMetadata based on schemaMetadata #11642

Open kartikey-visa opened 1 month ago

kartikey-visa commented 1 month ago

Describe the bug On ingestion when schemaMetadata of a dataset is refreshed like removal of columns, the same doesn't get reflected in editableSchemaMetadata and the linkings to the removed columns still exist in editableSchemaMetadata

To Reproduce Steps to reproduce the behaviour:

  1. Attach glossary terms/business attributes to few columns of a dataset in datahub.
  2. Now delete those columns from source and re-ingest the dataset in datahub.
  3. Click on Related Entities tab of the glossary term which was attached to the now deleted column in dataset.
  4. Find or scroll down to that particular dataset it will still show up in Related Entities tab.
  5. Since the column to which glossary term was attached is now deleted, it should not show up in Related Entities, but it comes up in the list because the column along with its linkings (glossary terms etc.) still exists in editableSchemaMetadata

Expected behaviour When the schemaMetadata of a dataset is updated (removal of columns), it should also update editableSchemaMetadata as part of its side effect.

Additional context This problem is bound to occur in all the editable aspects of datahub.

jjoyce0510 commented 2 weeks ago

Thanks for reporting the issue. It sounds like we are not properly cleaning up once a column is removed from SchemaMetadata.

We are processing this internally, but as of today this is not an immediate priority.

In terms of the solution: The best way to address would be using a MetadataChangeProposal side effect to update the editableSchemaMetadata aspect when the schemaMetadata aspect changes.