I'm trying to understand the behavior when visualising the lineage graph of a dataset that has been created by one job, and then subsequently a different job runs with the same output. For example, if I run job foo that takes A as input and outputs dataset B you get A->foo->B as expected.
If I then run a different job bar that also outputs to B from A, then both foo and bar are visible in the lineage. I would expect only bar to be included since the most recent version of dataset B is derived only through job bar. Is this expected behavior?
This seems to be different lineage logic to when running the same job with different input datasets. For example, if I run a job with input A and then run it again with inputs B and C, only B and C will be visible in the lineage graph as inputs for the most recent version of the job run.
I'm trying to understand the behavior when visualising the lineage graph of a dataset that has been created by one job, and then subsequently a different job runs with the same output. For example, if I run job
foo
that takesA
as input and outputs datasetB
you get A->foo->B as expected.If I then run a different job
bar
that also outputs toB
fromA
, then bothfoo
andbar
are visible in the lineage. I would expect onlybar
to be included since the most recent version of datasetB
is derived only through jobbar
. Is this expected behavior?This seems to be different lineage logic to when running the same job with different input datasets. For example, if I run a job with input
A
and then run it again with inputsB
andC
, onlyB
andC
will be visible in the lineage graph as inputs for the most recent version of the job run.Thanks for your help in understanding this