MarquezProject / marquez

Collect, aggregate, and visualize a data ecosystem's metadata
https://marquezproject.ai
Apache License 2.0
1.77k stars 318 forks source link

Binary incompatibility after deleting LineageEvent classes #1650

Open collado-mike opened 3 years ago

collado-mike commented 3 years ago

The change in https://github.com/MarquezProject/marquez/pull/1593 made the marquez-api jar incompatible with code that had depended on the LineageEvent class and its related classes. Any code that depended on those models must now be rewritten to rely on the OpenLineage.* models, which have a very different construction model, thus require a major effort to rewrite.

Moreover, the current OpenLineage API has introduced new fields in the InputDataset and OutputDataset models, which were never present in the Marquez implementation of the OpenLineage models. The LineageEvent model is annotated with @JsonIgnoreProperties so any new fields in the JSON are simply dropped during deserialization. Therefore, simply reverting the LineageEvent models would make the Marquez backend incompatible with the new OpenLineage models as new facets would be dropped from the model before storing.

I think we should revert #1593 and alter the models to support unknown fields. Some options for this are

  1. Add a Map<String, Object> field annotated with @JsonAnySetter so that any unknown fields are added to the map, rather than dropped.
    • This is little work up front and offers backward and forward compatibility, as any unknown fields are automatically supported. There is some maintainability concern, as we need to update the Marquez model alongside the OL one.
  2. Extend or wrap (using @JsonUnwrapped) Jackson ObjectNode so that objects are automatically deserialized into JsonNodes and setters/getters are written to work with expected properties in a compatible API
    • This is the most up-front work, but offers the most compatibility and least maintenance. Each model is backward and future compatible with any event POSTed and will always be serialized back into an exact replica of the original event. Accessor methods must be hand-written to replace the lombok-generated ones in order to maintain API compatibility.
  3. Wrap new OpenLineage model classes with existing Marquez models
    • This provides the binary compatibility we need, while avoiding the maintenance issue of synchronizing the Marquez models with the OpenLineage ones. The payload would always be deserialized into OpenLineage models (so we can receive and store the data even if the Marquez model is never updated). However, we still need to maintain the compatibility layer (the accessor methods) and we are still limited to the fields defined in the version of the OL library deployed with Marquez. Moreover, the OL API for constructing events is a bit cumbersome to use in a case like this. Each model class must be instantiated by an instance of the OpenLineage class, which is instantiated with the appropriate producer field. Thus, we can't simply instantiate a new Job or JobFacet and expect the accompanying OpenLineage.Job or OpenLineage.JobFacets class to be instantiated, as there needs to be a shared OpenLineage instance to actually create the instances. This is easy enough to accomplish for model instances that are created purely from Marquez (e.g., a static utility instance), but makes it very difficult to build a processing workflow, such as one that clones a model and adds a new facet (and maintains the original models' producer fields) before handing off to another processor.
  4. Write custom deserializer to automatically add raw JSON string to LineageEvent object
    • This is the least work and solves the most immediate problem- that data serialized and stored in the lineage_events table is incomplete. However, it makes processing objects that have unknown fields impossible- e.g., a workflow that copies a LineageEvent and adds another facet to the Run before passing on to storage or another processor would immediately lose information. It also does not offer any additional maintainability support, as the Marquez models must always be updated to synchronize with the OL models.

Of the four options, the first offers the most compatibility with the most flexibility while maintaining forward/backward compatibility and relatively low maintainability concern.

julienledem commented 3 years ago

For option 1, I proposed server model in OpenLineage: https://github.com/OpenLineage/OpenLineage/pull/300

wslulciuc commented 3 years ago

Thanks for the great write up, @collado-mike. I think whatever approach we go with, Marquez should eventually use the OpenLineage server-specific models defined for consumption, see https://github.com/OpenLineage/OpenLineage/issues/67. That said, I'd favor option 1. Using Map<String, Object> to capture any additional properties that are not part of the core OpenLineage RunEvent class gives us enough flexibility to access facets. For option 2, I'd like to avoid hand-written methods or classes in favor of using generated classes by OpenLineage, similarly for the remaining options.

There is some maintainability concern, as we need to update the Marquez model alongside the OL one

Given PR OpenLineage/OpenLineage#300 opened by @julienledem, maintainability only becomes a concern when the core OpenLineage RunEvent changes?