Open collado-mike opened 3 years ago
For option 1, I proposed server model in OpenLineage: https://github.com/OpenLineage/OpenLineage/pull/300
Thanks for the great write up, @collado-mike. I think whatever approach we go with, Marquez should eventually use the OpenLineage server-specific models defined for consumption, see https://github.com/OpenLineage/OpenLineage/issues/67. That said, I'd favor option 1
. Using Map<String, Object>
to capture any additional properties that are not part of the core OpenLineage RunEvent
class gives us enough flexibility to access facets. For option 2
, I'd like to avoid hand-written methods or classes in favor of using generated classes by OpenLineage, similarly for the remaining options.
There is some maintainability concern, as we need to update the Marquez model alongside the OL one
Given PR OpenLineage/OpenLineage#300 opened by @julienledem, maintainability only becomes a concern when the core OpenLineage RunEvent
changes?
The change in https://github.com/MarquezProject/marquez/pull/1593 made the
marquez-api
jar incompatible with code that had depended on theLineageEvent
class and its related classes. Any code that depended on those models must now be rewritten to rely on theOpenLineage.*
models, which have a very different construction model, thus require a major effort to rewrite.Moreover, the current OpenLineage API has introduced new fields in the
InputDataset
andOutputDataset
models, which were never present in the Marquez implementation of the OpenLineage models. TheLineageEvent
model is annotated with@JsonIgnoreProperties
so any new fields in the JSON are simply dropped during deserialization. Therefore, simply reverting theLineageEvent
models would make the Marquez backend incompatible with the new OpenLineage models as new facets would be dropped from the model before storing.I think we should revert #1593 and alter the models to support unknown fields. Some options for this are
Map<String, Object>
field annotated with@JsonAnySetter
so that any unknown fields are added to the map, rather than dropped.@JsonUnwrapped
) JacksonObjectNode
so that objects are automatically deserialized into JsonNodes and setters/getters are written to work with expected properties in a compatible APIOpenLineage
model classes with existing Marquez modelsOpenLineage
models (so we can receive and store the data even if the Marquez model is never updated). However, we still need to maintain the compatibility layer (the accessor methods) and we are still limited to the fields defined in the version of the OL library deployed with Marquez. Moreover, the OL API for constructing events is a bit cumbersome to use in a case like this. Each model class must be instantiated by an instance of theOpenLineage
class, which is instantiated with the appropriateproducer
field. Thus, we can't simply instantiate a newJob
orJobFacet
and expect the accompanyingOpenLineage.Job
orOpenLineage.JobFacets
class to be instantiated, as there needs to be a sharedOpenLineage
instance to actually create the instances. This is easy enough to accomplish for model instances that are created purely from Marquez (e.g., a static utility instance), but makes it very difficult to build a processing workflow, such as one that clones a model and adds a new facet (and maintains the original models'producer
fields) before handing off to another processor.lineage_events
table is incomplete. However, it makes processing objects that have unknown fields impossible- e.g., a workflow that copies aLineageEvent
and adds another facet to theRun
before passing on to storage or another processor would immediately lose information. It also does not offer any additional maintainability support, as the Marquez models must always be updated to synchronize with the OL models.Of the four options, the first offers the most compatibility with the most flexibility while maintaining forward/backward compatibility and relatively low maintainability concern.