Support for a cumulative lineage graph at both levels

davidjgoss commented 1 year ago

Problem

Currently, there is a mismatch between how the dataset/job-level and column-level lineage endpoints behave:

/v1/lineage returns a graph of datasets and jobs only considering the latest job version (i.e. the last run) of involved jobs
/v1/column-lineage returns all column-level relations from all runs

With the first endpoint, this means if a job's behaviour is variable between runs (e.g. sometimes it touches a dataset, and sometimes it doesn't), you can have datasets drop off the graph even though the column-level relation will still be reflected on the second endpoint. This means that if you want to combine the graphs to form a dataset- and column-level visualisation of lineage, things can be a bit weird e.g. a column-to-column relationship could be seen but without a corresponding dataset-job-dataset relation.

More generally, seeing the cumulative lineage graph based on all historical runs would be desirable, where the user is trying to ascertain what has actually been happening over time rather than the current state of things.

Context

This has been reported as an issue a few other times:

As discussed with @wslulciuc and @collado-mike recently, the /v1/lineage endpoint used to have this cumulative behaviour, but it became too slow in practise so was refactored to just use the latest versions (maybe in https://github.com/MarquezProject/marquez/pull/2472?).

Also, there is static lineage coming in https://github.com/MarquezProject/marquez/issues/2624 which further leans into the "just current versions" thing but also is an opportunity to solidify two distinct flavours of lineage - static vs cumulative.

Finally, somewhat related is the idea of "time travel", referenced in:

Proposal

Support returning a cumulative graph based on all runs. This could be with an optional query parameter, something like ?cumulative=true, or (more likely?) a new endpoint.

This would involve a different query which we would need to try and make perform acceptably. The performance issues may be mitigated to some degree by the recent addition of retention and cleanup functionality.

Also, we discussed the idea of whether restricting the time period may affect query performance, and if so whether a default of e.g. 7 days with an optional parameter to look further back might allow for good user experiences to be built around this.

Finally, a lot performance could be gained by making this endpoint just return the node ids and edges - the metadata for involved jobs and datasets can then be pulled async in separate requests.

Another consideration is column lineage - could this also be made to offer static vs cumulative flavours? This could be difficult given the column-level lineage relations exist directly between dataset versions and fields.

("Cumulative" is not necessarily the best word for this, but it's the one I keep thinking of.)

davidjgoss commented 10 months ago

IMO, for an API response here, we could go as simple as just a list of source/target pairs of node ids:

{
  "edges": [
    {
      "source": "dataset:my-namespace:first-input",
      "target": "job:my-namespace:fancy-job"
    },
    {
      "source": "dataset:my-namespace:second-input",
      "target": "job:my-namespace:fancy-job"
    },
    {
      "source": "job:my-namespace:fancy-job",
      "target": "dataset:my-namespace:output-table"
    }
  ]
}

The rest of what you need to render a useful graph can be grabbed async from other endpoints with those ids.

wslulciuc commented 9 months ago

@davidjgoss I've responded partially to your proposal (and what we plan as a solution) in https://github.com/MarquezProject/marquez/issues/2543 (see comments). First, you've done an amazing job at outlining the context. I'll touch upon your comments below.

Finally, a lot performance could be gained by making this endpoint just return the node ids and edges - the metadata for involved jobs and datasets can then be pulled async in separate requests.

I agree. We should have two modes here: light and heavy (I can't think of better names). Anyways, we should've introduced this sooner. A light lineage query would return only the nodeIDs (as you've outlined above) while heavy would return lineage with metadata for dataset and job nodes pre-fetched.

Another consideration is column lineage - could this also be made to offer static vs cumulative flavours? This could be difficult given the column-level lineage relations exist directly between dataset versions and fields.

Yes, but it doesn't have to. I think we should evaluate the level of effort to make this an option! Let's discuss further and work together on a proposal.

MarquezProject / marquez