liangjun-jiang commented 4 years ago

Is your feature request related to a problem? Please describe. This feature request is a continuation of add lineage workflow schedule support, also the implementation of Jobs & Flows as entities within Roadmap. The intention is to use this new issue to bring in the design and for discussion. The majority of implementation is finished.

The problem or feature request statement

The lineage among datasets presents the relationship among datasets. In the following script1 SQL example,
```
INSERT INTO TABLE mytable 
SELECT c1,c2 FROM
  (SELECT count(*) FROM test2) AS c1
JOIN
  (SELECT count(*) FROM test3) AS c2;
```
mytable comes from table c1 and table c2. If illustrated by lineage graph, it will be presented as follows.

Screen Shot 2020-05-06 at 9 38 20 AM

However, this lienage graph doesn't really show it is the script1 which extracts columns from c1 and c2, and formed mytable. In the real ETL world, it is common that a ETL job or an Airflow scheduler or a Kafka consumer or producer job have been performed to form a new dataset, it is important to represent the job or flow in Datahub.

As @clojurians-org proposed, this feature can be implemented by a few steps.

Dataset Lineage

Step 1. Present the job information as part of a dataset's UpstreamLineage or DownstreamLineage. Assume a job has a URN as urn:li:job:(urn:li:dataPlatform:hbase,JohnDoe,DEV), we will include this job in the UpstreamLineage as follows

 {
    "com.linkedin.dataset.UpstreamLineage": {
        "upstreams": [
            {
                "auditStamp": {
                    "time": 0,
                    "actor": "urn:li:corpuser:kzhang10"
                },
                "dataset": "urn:li:dataset:(urn:li:dataPlatform:hbase,barUp,PROD)",
                "job": "urn:li:job:(urn:li:dataPlatform:hbase,JohnDoe,DEV)",
                "type": "TRANSFORMED"
            }
        ]
    }                  
  },

To understand better what's the difference, here is the before

{
    "com.linkedin.dataset.UpstreamLineage": {
        "upstreams": [
            {
                "auditStamp": {
                    "time": 0,
                    "actor": "urn:li:corpuser:kzhang10"
                },
                "dataset": "urn:li:dataset:(urn:li:dataPlatform:hbase,barUp,PROD)",
                "type": "TRANSFORMED"
            }
        ]
    }                  
  },

This also required some UI changes. The proposed change is highlighted as follows job-added

Job Entity

Step 1 also includes the definition of Job entity and its aspect. The inital proposal as follows Similar to Dataset, Job has four basic fields: urn, name, platform and origin. In the meantime, Job has an aspect named JobInfo with the following fields

{
  "type": "record",
  "name": "JobInfo",
  "namespace": "com.linkedin.job",
  "doc": "The inputs and outputs of this job",
  "fields": [
    {
      "name": "inputs",
      "type": {
        "type": "array",
        "items": "com.linkedin.common.DatasetUrn"
      },
      "doc": "the inputs of the job",
      "optional": true
    },
    {
      "name": "outputs",
      "type": {
        "type": "array",
        "items": "com.linkedin.common.DatasetUrn"
      },
      "doc": "the outputs of the job",
      "optional": true
    }
  ]
}

We will also need to change UpStream.pdsc and Downstream.pdsc to add jobUrn as follows

...redacted 
{
      "name": "dataset",
      "type": "com.linkedin.common.DatasetUrn",
      "doc": "The upstream dataset the lineage points to"
    },
    {
      "name": "job",
      "type": "com.linkedin.common.JobUrn",
      "doc": "The upstream job the lineage associates with",
      "optional": true
    },
...redacted

We also need to make change of Snapshot.pdsc to include JobSnapshot By having the changes above, we will be able to present job as part of lineage of a dataset

!!

As a note, all those features above have been implemented. If this design review is approved, a pull request will be filled

Fully Onboard `Job` Entity

To fully onboard the Job entity, we will need to follow [How to onboard an entity](How to onboard an entity) to have features:

Fully rest APIs for Job such as create, get, update, delete, search. We also need to be able to query subresource of Job. The current implementation status is as follows: [x] create a job [x] get a job [x] get jobInfo resource in a job [ ] search a job [ ] update a job [ ] delete a job

Describe the solution you'd like A clear and concise description of what you want to happen.

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.

hshahoss commented 4 years ago

Hi @loftyet, The overall design looks good and aligns with what we have been thinking as well. I would like to propose a few things:

It will be better to call the entity DataJob to disambiguate from other meanings of the term "job".
It is better to separate out Job input and outputs as separate aspects so that they can be independently updated while the job is running. So two aspects (DataJobInput and DataJobOutput) as below:

{
  "type": "record",
  "name": "DataJobInput",
  "namespace": "com.linkedin.datajob",
  "doc": "The inputs of a data job",
  "fields": [
    {
      "name": "inputs",
      "type": {
        "type": "array",
        "items": "com.linkedin.common.DatasetUrn"
      },
      "doc": "the inputs of the job"
    }
  ]
}

{
  "type": "record",
  "name": "DataJobOutput",
  "namespace": "com.linkedin.datajob",
  "doc": "The outputs of a data job",
  "fields": [
    {
      "name": "outputs",
      "type": {
        "type": "array",
        "items": "com.linkedin.common.DatasetUrn"
      },
      "doc": "the inputs of the job"
    }
  ]
}

Additionally you will likely have an additional aspect of DataJobInfo which would capture basic job information like name, Id, job type, scheduler, job parameters etc.

I am interested in learning more on your representation of JobUrn. Data platform may not make sense for all jobs. A suggestion is to think a bit more on what would uniquely represent a data job. May be the job name/Id, job type and type of job scheduler (azkaban, airflow) be better able to represent a data job. We can discuss more on that.
You mentioned that in the UI you want to show job information when we hover/click on the edge. Since we are representing DataJob as a first class entity, an alternative is to show DataJobs as nodes in the graph as well which connect to Datasets.

liangjun-jiang commented 4 years ago

Hi @hshahoss . thanks for the feedback. Feedback 1 & 2 are accepted and will be adopted. In terms of other aspect of a DataJob, name has been factored in. name, platform and origin(fabrictype) are three basic aspects. I am thinking to add other information such as job type, scheduler etc, as you suggested, while there are more feedback or other use cases.

JobData will be the first class entity in datahub. It will have its place to be presented. For example, the current dropdown menu lets a user choose from dataset and people, I think we also need to add DataJob. There will be more dedicated UI & UX for DataJob, I would image it will be similar to dataset. I have not gone through all the details. Introducing lineage change in dataset has broken the UI, as I have seen.
Internally, we also have that debate whether DataJob should be a node. My first thought is to have DataJob as a node. When I draw it down, I found it is more confusing to interpret the graph because of the introduction of extra entity. Implementation-wise, I have not done ember.js development before so I couldn't say whether it is possible or how much effort. I think the first step is to let the edge be the meaning of Datajob, and it's easier have this feature implemented and interpreted, we can have more discussion or PR later.

mars-lan commented 4 years ago

For 2. are inputs & outputs predeclared as part of the job config? Or are these information derived when the job is executed? I feel like there's some overlap between these and what's captured in dataset's UpstreamLineage.

liangjun-jiang commented 4 years ago

Ideally, it should be derived when the job is executed. I think it's also what @clojurians-org is working on.

liangjun-jiang commented 4 years ago

@hshahoss @mars-lan are you guys open I start to send PR. I imagine to split the whole feature in a smaller steps to make it easier to review. I have planed the steps

define urn, aspects and entity and actionBuilder so new entity can be ingested from ETL
define rest.li APIs so new entity can be created and get from endpoints. Also with documentation in the gms readme. Those two steps won't create model compatibility issue.
add job urn into the upstream lineage of dataset . this change will create model compatibility issue, and might also break frontend UI.

hshahoss commented 4 years ago

Hi @loftyet Yes happy to review PR for the work and makes sense to split the PR. I would suggest you create a separate PR for the urn itself as the first step. Then next PR can include aspects, entity and aspect builder.

I am not completely clear on the design of the urn. We can discuss the design here or on slack if you want to share it before the PR or we can discuss it in the PR directly.

clojurians-org commented 4 years ago

attached

------------------ 原始邮件 ------------------ 发件人: "Harsh Shah"<notifications@github.com>; 发送时间: 2020年5月14日(星期四) 凌晨5:25 收件人: "linkedin/datahub"<datahub@noreply.github.com>; 抄送: "larluo"<larluo@clojurians.org>;"Mention"<mention@noreply.github.com>; 主题: Re: [linkedin/datahub] Onboard Job/Flow Entity (#1659)

Hi @loftyet Yes happy to review PR for the work and makes sense to split the PR. I would suggest you create a separate PR for the urn itself as the first step. Then next PR can include aspects, entity and aspect builder.

I am not completely clear on the design of the urn. We can discuss the design here or on slack if you want to share it before the PR or we can discuss it in the PR directly.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

liangjun-jiang commented 4 years ago

Just want to give an update about datajob urn design, after discussed with @hshahoss, and proposed by @hshahoss . We will drop platform as part of job urn definition, use datajob Orchestrator instead. The reason is that platform we have referred as hbase, mysql etc. datajob tranforms data from one platform, and most likely, to another platform. platform won't represent datajob well.

and we will also start with string to represent this job orchestrator. it could be Apache Airflow, Azure Data Factory, etc.

Finally, the datajob should be looking like "datajob": "urn:li:datajob:(airflow,JohnDoe,DEV)"

datahub-project / datahub

Onboard Job/Flow Entity #1659

The problem or feature request statement

Dataset Lineage

Job Entity

Fully Onboard `Job` Entity

datahub-project / datahub

Onboard Job/Flow Entity #1659

The problem or feature request statement

Dataset Lineage

Job Entity

Fully Onboard Job Entity

Fully Onboard `Job` Entity