Dataset missing from lineage graph

yonivy commented 1 year ago

A colleague of mine of opened an issue in the OpenLineage repo and received no response so far so perhaps this is the right place to post issues in :)

The issue we are facing is that Marquez seems to break lineage if the same logical job produces different datasets on different runs. Our reality (and I believe others as well) is that our processes are dynamic in their output. I do not think this is an edge-case.

The use case is this:

We have a logical ETL job which is scheduled to run a few times during the day.
The job pushes data into tables based on the contents of the input files (which are in S3).

Example

The example below is super simplified but I believe it paints the right picture.

Job name: users_etl Job input: The last modified file(s) found in the path template s3:///users/{yyyy}/{mm}/{dd}

Run no. 1

The input file contains nested user info (first_name, last_name, email, address: {city, state}) so the job will update the users table (which has the first_name, last_name and email columns) and the table users_address which has the city and state columns).

Output:

users table
users_address table

Run no. 2

The input file contains flat user info (first_name, last_name, email) so the job will update the users table (which has the first_name, last_name and email columns).

Output:

users table

The Problem

In Marquez I can only see the users table in the lineage of the users_etl job. The users_address dataset gets orphaned.

The state after Run no. 1

Everything is as expected.

The state after Run no. 2

Only the latest output is displayed.

and the previous output is now completely detached from the lineage graph!

The Expectation

I expected to continue and see the users_address table in the lineage graph. Without it all I'm getting is last-run lineage and while that is useful for some cases it presents a confusing image which does not reflect the reality of relationships between jobs and datasets. I mean what can I understand about the users_address table, that it simply popped into existence?

boring-cyborg[bot] commented 1 year ago

Thanks for opening your first issue in the Marquez project! Please be sure to follow the issue template!

rkrao89 commented 1 year ago

We have the exact same issue while working with OpenLineage and Spark. It would be great if this gets fixed soon. Without this its almost unusable.

yonivy commented 1 year ago

@rkrao89 I very much agree with you. Unfortunately there's zero response from the maintainers which is a shame because the project looked very promising to us.

mobuchowski commented 1 year ago

@yonivy please, do not be such judgemental for Open Source project maintainers that deliver code for you without any expectations... especially since the solution is actively worked upon in OpenLineage repo - as it's the source of the problem, not UI/backend part.

yonivy commented 1 year ago

No judgment here just observing the state of my question. It's also fine if OpenLineage won't solve my case (even though I think it's a common one) I was just hoping for some response. In any case I'll just say that I was very happy to find out that OpenLineage exists and I appreciate the work that open-source maintainers do so apologies if it came out wrong.

As for the draft PR you linked it seems new so I did not see it when I asked my question a month ago but it seems spark specific (isn't it?) so it probably won't solve my case. I'll subscribe to it so I appreciate the link :)

mobuchowski commented 1 year ago

Some issues unfortunately can go through the cracks - fortunately we were already aware of the issue when you created this one. Thanks for understanding, we hope to solve the problem soon 🙂

githubopenlineageissues commented 1 year ago

Hi @mobuchowski , I can be wrong but it seems the repo/PR addresses https://github.com/OpenLineage/OpenLineage/issues/1965 which OP referenced as raised by his coworker in past. The issue OP is referring in this page about lineage not showing users_address seems to inherent to Dynamic Lineage. The latest run shows what latest run knows, it has no memory of prior runs. May be static lineage to rescue, will be really curious to know if solution exist for issue reported by OP on this page. Thank you again for great work community is doing.

mobuchowski commented 1 year ago

@githubopenlineageissues we want to recognize those jobs as inherently different - let's say you have a Spark job or microservice, or even CI task that copies data from A to B - but you provide those A and B when running the job. So, in reality, their only common thing is the fact they share code - but they are different "instances" of those jobs. This is logically similar to tasks in Airflow - you can have multiple PostgresOperators in a DAG, but that does not mean they are the same OpenLineage job.

AryamanMishra commented 9 months ago

Hey we are having the same issue of orphaned datasets, pretty similar to https://github.com/OpenLineage/OpenLineage/issues/1965. Any leads?

wslulciuc commented 7 months ago

@yonivy: you raise a very good point (and also apologize for those on the thread on not getting back until now). I agree that this is more broad and not specific to an OL integration (like spark, or airflow), but there are some challenges to ensure the lineage graphs completeness. But first, let me outline what Marquez supports for lineage:

Static Lineage; static lineage represents the current graph (i.e. the most recent OL events that have been collected on the backend -- this is what you are seeing now).
Column-Level Lineage that, given a runID will return the column lineage at the time of job execution (relative to the run).

The Marquez model captures lineage from run-to-run and that run-level lineage metadata can be queried, but there isn't an API (yet!) that given runID, will return a lineage snapshot at the time of job execution (similar to column lineage). We do have a proposal that would help in resolving what you (and others) are seeing and is on our roadmap. The API will support Run-level Lineage, that given a runID, will return the edges that are no longer present in the static lineage graph.

The challenges of lineage graphs completeness is that we would have to assume (and this would be a big assumption) that if a dataset was present on run 1, but is no longer present on run 2 that 1) it wasn't intended or 2) it was and we should merge the edges from run-to-run. We are making significant improvements to the UI (see the PR from @phixMe) that will make viewing static, column-level and soon run-level lineage more intuitive but also display the highly dimensional model of Marquez in a more exploratory way.

I've added this issues to our roadmap and will link it when we start working on run-level lineage (which will be within the next month or so). I hope this helps to clarify things. It doesn't solve the issue now, but hope you will find the run-level lineage API useful.

I would love initial thoughts on what I've outlined here (but also in my proposal) from yourself and anyone who has run into this issue (@rkrao89, @yonivy, @AryamanMishra).

MarquezProject / marquez