Open dkt-sophie-ly opened 5 months ago
@dkt-sophie-ly, with PR #2641, we should expect the lineage graph to begin using the symlink in the DatasetEvent
. But, after looking into how the lineage graph is built:
LineageDao.getLineage()
to get the job node dataI don't think we've invested heavily on building out symlink support for our lineage graph. @pawel-big-lebowski let me know if that's not the case.
Hi @wslulciuc ! Thanks for your reply :)
With this PR https://github.com/MarquezProject/marquez/pull/2736 the lineage graph should be able to see the lineage of symlink dataset but only if the symlinks is built beforehand like that. ex:
{"input1":
"symlinks":
"identifiers" [{"namespace": "ns2", "name": "input2"}]
}
Here input1 and input2 can be linked together because they have the same dataset uuid in datasets_view.
If the symlink is created afterwards (both dataset created separately with 2 different runs and then a dataset event add a symlink between these 2 lineage) the lineage won't be linked because they already have a different dataset uuid.
I don't know if it will be possible but it could be great if a symlinks is created afterwards with a DatasetEvent the dataset uuid change accordingly (ex: change input2 dataset uuid to be the same as input1).
Hi @wslulciuc Just a kind reminder in this issue :)
If I create 2 run events that create 2 separate lineage like the following:
ns1:input1 ----- job1 -----> ns2:output1
and
ns1:input2 ------ job2 -----> ns2:output
Then I sent a DatasetEvent to create a symlink and specify that input1 and input2 are in fact the same dataset.
So I expected that the 2 lineage merge into one like the following:
ns1:input1 ----- job1 -----> ns2:output1 | |--------------- job2 ------> ns2:output2
But currently both lineage are not merge and stay separated.