OpenLineage / OpenLineage

An Open Standard for lineage metadata collection
http://openlineage.io
Apache License 2.0
1.78k stars 309 forks source link

[PROPOSAL] Make spark & flink integration classes more extendable #2273

Open athityakumar opened 1 year ago

athityakumar commented 1 year ago

Purpose:

Creating this issue, as per discussion on this Slack thread.

For teams trying to adopt openlineage at scale in their workplace, typically it'd be preferred to have one place/team that owns & manages all openlineage-related configurations, integrations etc centrally. For this purpose of centrally managing openlineage, users need to be able to extend the integration classes as much as possible to re-use existing OSS code - while only having to override minimal methods to suit their workpace-specific customisations.

However, the Spark (io.openlineage.spark.agent.OpenLineageSparkListener) & Flink (io.openlineage.flink.OpenLineageFlinkJobListener) listener implementation classes currently are not directly extendable out-of-the-box. We have most of the fields/attributes set as private without any getters/setters, and the default constructors coming from @Builder has protected scope.

Proposed implementation

athityakumar commented 1 year ago

@mobuchowski - Please lmk if we'd prefer automatic getters/setters via Lombok, or explicit getters/setters.

If the openlineage listeners already have dependency on Lombok, I think we can use Lombok to add the getters/setters. Else, I'd prefer to just add the getters/setters manually, instead of adding Lombok as a new dependency just for adding getters/setters (especially because, new dependencies could have version/dependency conflicts with the actual Spark/Flink job code)

mobuchowski commented 1 year ago

@athityakumar I agree with proposed changes. We already use Lombok a lot (eg. https://github.com/OpenLineage/OpenLineage/blob/02a01a1ed60916424bff7755942e0f2e10ae8c67/integration/spark/shared/src/main/java/io/openlineage/spark/api/AbstractDatasetFacetBuilder.java#L20) , so I think using there would be a natural choice.

Also, Lombok is compile-time-only dependency, so it won't cause conflicts with Spark or Flink.