MarquezProject / marquez

Collect, aggregate, and visualize a data ecosystem's metadata
https://marquezproject.ai
Apache License 2.0
1.78k stars 320 forks source link

Support improved transformation metadata from column lineage #2851

Open davidjgoss opened 4 months ago

davidjgoss commented 4 months ago

The OpenLineage standard column lineage facet has been extended in 1.17.1 so that each field in inputFields can now have an array of transformations describing transformations specific to that input field in the context of the output field. See https://github.com/OpenLineage/OpenLineage/pull/2756.

Ideally Marquez should support storing and serving this data if present in OpenLineage events.

Note that the existing transformationType and transformationDescription fields at the output field level still exist but have been deprecated.

Database

The corresponding table in Marquez would be column_lineage, with each row there effectively representing one entry in inputFields. We could add another table joining with this e.g. column_lineage_transformations or - perhaps more pragmatically - use a JSON column on the existing table to hold transformations.

API

The transformations array could be added to the ColumnLineageInputField model which is included in the column lineage response and the dataset response.

mattwparas commented 3 months ago

I'd be happy to contribute this change (since I would also like to see the feature implemented), but would probably need a little guidance on how to get started

wslulciuc commented 4 weeks ago

Great suggestion @davidjgoss; there's a similar discussion in issue https://github.com/MarquezProject/marquez/issues/2874