MarquezProject / marquez

Collect, aggregate, and visualize a data ecosystem's metadata
https://marquezproject.ai
Apache License 2.0
1.78k stars 320 forks source link

support for OpenLineage's RUNNING eventType #2054

Open mobuchowski opened 2 years ago

mobuchowski commented 2 years ago

OpenLineage introduces RUNNING event type which models continuous streaming job that it currently running - to differentiate it from generic OTHER event type. Related issues are https://github.com/OpenLineage/OpenLineage/issues/946 and discussion here: https://github.com/OpenLineage/OpenLineage/issues/599

Are there any possible problems within Marquez with receiving those type of events? I know LineageEvent has String eventType - but there could be something else dependant on existing event types.

mzareba382 commented 2 years ago

Related PRs: https://github.com/OpenLineage/OpenLineage/pull/972 https://github.com/OpenLineage/OpenLineage/pull/985

collado-mike commented 2 years ago

While Marquez will support an event of type RUNNING, when considering this in the context of a streaming job, we may need to consider the impact of this event on job versions and dataset versions. Currently, Marquez sets the current version of a job and a dataset only when receiving a COMPLETE event. Dataset versions are created before then, but the dataset record itself isn't updated until COMPLETE. Job versions aren't created at all until a COMPLETE event is received. Most importantly, lineage only considers the current_version_uuid column of the jobs table. This means that a streaming job won't show any lineage at all until the job terminates with a COMPLETE event. We can update the logic here, but we need to know it's a streaming job. Perhaps a facet to report that it's a streaming job, not a batch job?