MarquezProject / marquez

Collect, aggregate, and visualize a data ecosystem's metadata
https://marquezproject.ai
Apache License 2.0
1.68k stars 293 forks source link

If a Dataset symlink is created afterwards with a DatasetEvent, the link is not created in the lineage #2738

Open dkt-sophie-ly opened 5 months ago

dkt-sophie-ly commented 5 months ago

If I create 2 run events that create 2 separate lineage like the following:

ns1:input1 ----- job1 -----> ns2:output1

and

ns1:input2 ------ job2 -----> ns2:output

Then I sent a DatasetEvent to create a symlink and specify that input1 and input2 are in fact the same dataset.

{
  "eventTime": "2023-07-18T17:20:00",
  "dataset": {
    "namespace": "ns1",
    "name": "input1",
    "facets": {
      "symlinks": {
        "identifiers": [
          {
            "namespace": "ns1",
            "name": "input2",
            "type": "DB_TABLE"
          }
        ]
      }
    }
  }
}

So I expected that the 2 lineage merge into one like the following:

ns1:input1 ----- job1 -----> ns2:output1 | |--------------- job2 ------> ns2:output2

But currently both lineage are not merge and stay separated.

wslulciuc commented 5 months ago

@dkt-sophie-ly, with PR #2641, we should expect the lineage graph to begin using the symlink in the DatasetEvent. But, after looking into how the lineage graph is built:

  1. Call LineageDao.getLineage() to get the job node data
  2. Then, LineageDao.getDatasetData for the dataset node data

I don't think we've invested heavily on building out symlink support for our lineage graph. @pawel-big-lebowski let me know if that's not the case.

dkt-sophie-ly commented 4 months ago

Hi @wslulciuc ! Thanks for your reply :)

With this PR https://github.com/MarquezProject/marquez/pull/2736 the lineage graph should be able to see the lineage of symlink dataset but only if the symlinks is built beforehand like that. ex:

{"input1":
"symlinks": 
"identifiers" [{"namespace": "ns2", "name": "input2"}]
}

Here input1 and input2 can be linked together because they have the same dataset uuid in datasets_view.

If the symlink is created afterwards (both dataset created separately with 2 different runs and then a dataset event add a symlink between these 2 lineage) the lineage won't be linked because they already have a different dataset uuid.

I don't know if it will be possible but it could be great if a symlinks is created afterwards with a DatasetEvent the dataset uuid change accordingly (ex: change input2 dataset uuid to be the same as input1).

dkt-sophie-ly commented 4 weeks ago

Hi @wslulciuc Just a kind reminder in this issue :)