Closed ghukill closed 6 years ago
To rerun, need to address that job_details.spark_code
has Job's hardcoded in.
POC working. But, revealing that cjob.rerun()
relies on rerun_jobs
route, to update some Job attributes. Consider, having both clone and rerun have an option to run as method, with flag, that would handle these without views.
e.g.
re_job.timestamp = datetime.datetime.now()
re_job.status = 'initializing'
re_job.record_count = 0
re_job.finished = False
re_job.elapsed = 0
re_job.url = None
re_job.deleted = True
re_job.save()
Also, a flag for clone()
to automatically rerun Job. To that end, should it drop records? Probably, as that's inaccurate. And, so, it might automatically handle these things?
POC working. Next to address: downstream Jobs.
Cloning would be particularly helpful if it could clone downstream Jobs as well. To do this, would need to clone those Jobs, but rewrite JobInput
links and input_jobs
attribute from job_details
to reflect the new Job ID of the cloned origin (and on down the line).
In a situation like the following, where cloning j2
:
Harvest Job (j1) --> Transform (j2) --> Merge (j3)
Should result in:
Harvest Job (j1) --> Transform (j2) --> Merge (j3)
|
V
Transform (Clone) (j4) --> Merge (Clone) (j5)
Approach:
j2
, which would include j3
j2
, do not worry about rewriting input Job IDs j1
(should be original)j3
, rewrite input Job that matches Job above in lineage, e.g. original input Job of j2
should now be j4
Progress on downstream cloning, but encountering scenario where E (Clone)
should have input Job of C (Clone)
not C
:
Problem is from simplistic -1 counting for parent clone, which only works when Jobs are single stream from origin.
Better approach might be to get all parents (JobInput links) for a Job, and add to set e.g.:
clone_hash = {
`C`:`C (Clone)` <-- but will be Job ids, 641:644
}
Then, when encountering any JobInput link, check hash to see if original parent has a clone, and use.
Should look like this:
Might also consider moving alteration of JobInput
and input_job_ids
to method, like alter_input_job_link()
Cloning now supports downstream and rerunning of cloned Jobs.
Example of cloning A
, which creates entirely new pipeline:
Example of cloning single Job C
, without downstream:
Example of cloning C
, with downstream, picking up unique filters applied to Jobs:
Rerunnning was problematic at first, as Spark had race conditions with the newly created Jobs in the MySQL db. It connects in a completely different thread/context than Django, or even background tasks, and appears to not be bound by the same read/lock transactions. This is both good and bad. The solution was to poll until the Job is created, timing out after 60 seconds.
Preparing to merge into dev
after some more testing, and downstream additions to deleting and moving Jobs.
Closing, merging to dev.
Reopening: if cloning multiple Jobs, including downstream, potential for clones being cloned:
It's hard to imagine a scenario where a newly created clone should, itself, be cloned. This is likely due to looping through selected Jobs, and running cjob.job.get_downstream_lineage()
. When B
clones, it creates a D (CLONE)
, which has a tether to C
. Then, when C
clones, it sees this new downstream Job and thinks that it should be cloned as well.
What should likely happen, during cloning, is to make sure the Job to be cloned is not already cloned, which could be determined by checking against clones
dictionary.
This would have resulted in the resulting lineage, which is desired:
Fixed. Closing (again).
Inspired by the relatively new ability to rerun Jobs, this would add the ability to clone a Job.
This may sound similar to Duplicate/Merge, but would be quite different. The main utility would be to clone settings for a Job, including the input Jobs (if any) or Harvest parameters (if any).
Rerunning a Job that contains important, or even more critically, irreplacable, Records is risky in that Records are removed before an attempt is made. It's conceivable there could be a rollback for reruns, temporarily detatching the Records from that Job until confirmed to delete, but cloning still has other utility.
It would a handy way to test workflows and validations before applying to a Job in situ, that may be part of an important pipeline.
Considerations: