Closed liangjun-jiang closed 4 years ago
Hi @loftyet, The overall design looks good and aligns with what we have been thinking as well. I would like to propose a few things:
It will be better to call the entity DataJob to disambiguate from other meanings of the term "job".
It is better to separate out Job input and outputs as separate aspects so that they can be independently updated while the job is running. So two aspects (DataJobInput and DataJobOutput) as below:
{
"type": "record",
"name": "DataJobInput",
"namespace": "com.linkedin.datajob",
"doc": "The inputs of a data job",
"fields": [
{
"name": "inputs",
"type": {
"type": "array",
"items": "com.linkedin.common.DatasetUrn"
},
"doc": "the inputs of the job"
}
]
}
{
"type": "record",
"name": "DataJobOutput",
"namespace": "com.linkedin.datajob",
"doc": "The outputs of a data job",
"fields": [
{
"name": "outputs",
"type": {
"type": "array",
"items": "com.linkedin.common.DatasetUrn"
},
"doc": "the inputs of the job"
}
]
}
Additionally you will likely have an additional aspect of DataJobInfo which would capture basic job information like name, Id, job type, scheduler, job parameters etc.
I am interested in learning more on your representation of JobUrn. Data platform may not make sense for all jobs. A suggestion is to think a bit more on what would uniquely represent a data job. May be the job name/Id, job type and type of job scheduler (azkaban, airflow) be better able to represent a data job. We can discuss more on that.
You mentioned that in the UI you want to show job information when we hover/click on the edge. Since we are representing DataJob as a first class entity, an alternative is to show DataJobs as nodes in the graph as well which connect to Datasets.
Hi @hshahoss . thanks for the feedback.
Feedback 1 & 2 are accepted and will be adopted. In terms of other aspect of a DataJob, name
has been factored in. name
, platform
and origin(fabrictype)
are three basic aspects. I am thinking to add other information such as job type
, scheduler
etc, as you suggested, while there are more feedback or other use cases.
dataset
and people
, I think we also need to add DataJob
. There will be more dedicated UI & UX for DataJob
, I would image it will be similar to dataset
. I have not gone through all the details. Introducing lineage
change in dataset
has broken the UI, as I have seen. DataJob
should be a node. My first thought is to have DataJob
as a node. When I draw it down, I found it is more confusing to interpret the graph because of the introduction of extra entity. Implementation-wise, I have not done ember.js
development before so I couldn't say whether it is possible or how much effort. I think the first step is to let the edge be the meaning of Datajob
, and it's easier have this feature implemented and interpreted, we can have more discussion or PR later.For 2. are inputs
& outputs
predeclared as part of the job config? Or are these information derived when the job is executed? I feel like there's some overlap between these and what's captured in dataset's UpstreamLineage
.
Ideally, it should be derived when the job is executed. I think it's also what @clojurians-org is working on.
@hshahoss @mars-lan are you guys open I start to send PR. I imagine to split the whole feature in a smaller steps to make it easier to review. I have planed the steps
urn
, aspects
and entity
and actionBuilder so new entity can be ingested from ETLgms
readme. Those two steps won't create model compatibility issue.job urn
into the upstream lineage of dataset . this change will create model compatibility issue, and might also break frontend UI. Hi @loftyet Yes happy to review PR for the work and makes sense to split the PR. I would suggest you create a separate PR for the urn itself as the first step. Then next PR can include aspects, entity and aspect builder.
I am not completely clear on the design of the urn. We can discuss the design here or on slack if you want to share it before the PR or we can discuss it in the PR directly.
attached
------------------ 原始邮件 ------------------ 发件人: "Harsh Shah"<notifications@github.com>; 发送时间: 2020年5月14日(星期四) 凌晨5:25 收件人: "linkedin/datahub"<datahub@noreply.github.com>; 抄送: "larluo"<larluo@clojurians.org>;"Mention"<mention@noreply.github.com>; 主题: Re: [linkedin/datahub] Onboard Job/Flow Entity (#1659)
Hi @loftyet Yes happy to review PR for the work and makes sense to split the PR. I would suggest you create a separate PR for the urn itself as the first step. Then next PR can include aspects, entity and aspect builder.
I am not completely clear on the design of the urn. We can discuss the design here or on slack if you want to share it before the PR or we can discuss it in the PR directly.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.
Just want to give an update about datajob
urn design, after discussed with @hshahoss, and proposed by @hshahoss . We will drop platform
as part of job
urn definition, use datajob Orchestrator
instead.
The reason is that platform
we have referred as hbase
, mysql
etc. datajob
tranforms data from one platform, and most likely, to another platform. platform
won't represent datajob
well.
and we will also start with string
to represent this job orchestrator
. it could be Apache Airflow
, Azure Data Factory
, etc.
Finally, the datajob
should be looking like "datajob": "urn:li:datajob:(airflow,JohnDoe,DEV)"
Is your feature request related to a problem? Please describe. This feature request is a continuation of add lineage workflow schedule support, also the implementation of
Jobs & Flows as entities
within Roadmap. The intention is to use this new issue to bring in the design and for discussion. The majority of implementation is finished.The problem or feature request statement
script1
SQL example,mytable
comes from tablec1
and tablec2
. If illustrated by lineage graph, it will be presented as follows.However, this lienage graph doesn't really show it is the
script1
which extracts columns from c1 and c2, and formedmytable
. In the real ETL world, it is common that a ETL job or an Airflow scheduler or a Kafka consumer or producer job have been performed to form a new dataset, it is important to represent the job or flow in Datahub.As @clojurians-org proposed, this feature can be implemented by a few steps.
Dataset Lineage
Step 1. Present the job information as part of a dataset's
UpstreamLineage
orDownstreamLineage
. Assume ajob
has a URN asurn:li:job:(urn:li:dataPlatform:hbase,JohnDoe,DEV)
, we will include thisjob
in the UpstreamLineage as followsTo understand better what's the difference, here is the before
This also required some UI changes. The proposed change is highlighted as follows
Job Entity
Step 1 also includes the definition of
Job
entity and itsaspect
. The inital proposal as follows Similar toDataset
,Job
has four basic fields:urn
,name
,platform
andorigin
. In the meantime,Job
has an aspect namedJobInfo
with the following fieldsWe will also need to change UpStream.pdsc and Downstream.pdsc to add
jobUrn
as followsWe also need to make change of Snapshot.pdsc to include
JobSnapshot
By having the changes above, we will be able to presentjob
as part of lineage of a dataset!!
Fully Onboard
Job
EntityTo fully onboard the
Job
entity, we will need to follow [How to onboard an entity](How to onboard an entity) to have features:Fully rest APIs for
Job
such ascreate
,get
,update
,delete
,search
. We also need to be able to query subresource of Job. The current implementation status is as follows: [x] create a job [x] get a job [x] getjobInfo
resource in a job [ ] search a job [ ] update a job [ ] delete a jobDescribe the solution you'd like A clear and concise description of what you want to happen.
Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.
Additional context Add any other context or screenshots about the feature request here.