MarquezProject / marquez

Collect, aggregate, and visualize a data ecosystem's metadata
https://marquezproject.ai
Apache License 2.0
1.78k stars 320 forks source link

Lineage graph of output dataset from multiple jobs #2640

Open ehbussell opened 1 year ago

ehbussell commented 1 year ago

I'm trying to understand the behavior when visualising the lineage graph of a dataset that has been created by one job, and then subsequently a different job runs with the same output. For example, if I run job foo that takes A as input and outputs dataset B you get A->foo->B as expected.

If I then run a different job bar that also outputs to B from A, then both foo and bar are visible in the lineage. I would expect only bar to be included since the most recent version of dataset B is derived only through job bar. Is this expected behavior?

image

This seems to be different lineage logic to when running the same job with different input datasets. For example, if I run a job with input A and then run it again with inputs B and C, only B and C will be visible in the lineage graph as inputs for the most recent version of the job run.

Thanks for your help in understanding this

boring-cyborg[bot] commented 1 year ago

Thanks for opening your first issue in the Marquez project! Please be sure to follow the issue template!