Ability to clone Job - Githubissues

ghukill commented 6 years ago

Inspired by the relatively new ability to rerun Jobs, this would add the ability to clone a Job.

This may sound similar to Duplicate/Merge, but would be quite different. The main utility would be to clone settings for a Job, including the input Jobs (if any) or Harvest parameters (if any).

Rerunning a Job that contains important, or even more critically, irreplacable, Records is risky in that Records are removed before an attempt is made. It's conceivable there could be a rollback for reruns, temporarily detatching the Records from that Job until confirmed to delete, but cloning still has other utility.

It would a handy way to test workflows and validations before applying to a Job in situ, that may be part of an important pipeline.

Considerations:

clone to same Oranization and/or Record Group, or allow choosing?
copy Records from source Job? or assume a rerun of the cloned Job will do this?
- or better yet, optional? effectivly triggering a "rerun" when the Job is cloned
if Input Jobs are present, handle JobInput links
- include counts?
Validations?
- propose including to simulate the same I/O

ghukill commented 6 years ago

To rerun, need to address that job_details.spark_code has Job's hardcoded in.

ghukill commented 6 years ago

POC working. But, revealing that cjob.rerun() relies on rerun_jobs route, to update some Job attributes. Consider, having both clone and rerun have an option to run as method, with flag, that would handle these without views.

e.g.

re_job.timestamp = datetime.datetime.now()
        re_job.status = 'initializing'
        re_job.record_count = 0
        re_job.finished = False
        re_job.elapsed = 0
        re_job.url = None
        re_job.deleted = True
        re_job.save()

ghukill commented 6 years ago

Also, a flag for clone() to automatically rerun Job. To that end, should it drop records? Probably, as that's inaccurate. And, so, it might automatically handle these things?

ghukill commented 6 years ago

POC working. Next to address: downstream Jobs.

Cloning would be particularly helpful if it could clone downstream Jobs as well. To do this, would need to clone those Jobs, but rewrite JobInput links and input_jobs attribute from job_details to reflect the new Job ID of the cloned origin (and on down the line).

In a situation like the following, where cloning j2:

Harvest Job (j1) --> Transform (j2) --> Merge (j3)

Should result in:

Harvest Job (j1) --> Transform (j2) --> Merge (j3)
                         |
                         V
                     Transform (Clone) (j4) --> Merge (Clone) (j5)

Approach:

get Job rerun lineage for cloned Job
- in this case, j2, which would include j3
loop through lineage, cloning, but do not rerun
- for first Job j2, do not worry about rewriting input Job IDs j1 (should be original)
- after that, e.g. j3, rewrite input Job that matches Job above in lineage, e.g. original input Job of j2 should now be j4
rerun all

ghukill commented 6 years ago

Progress on downstream cloning, but encountering scenario where E (Clone) should have input Job of C (Clone) not C:

cloning_lineage

Problem is from simplistic -1 counting for parent clone, which only works when Jobs are single stream from origin.

Better approach might be to get all parents (JobInput links) for a Job, and add to set e.g.:

clone_hash = {
  `C`:`C (Clone)` <-- but will be Job ids, 641:644
}

Then, when encountering any JobInput link, check hash to see if original parent has a clone, and use.

Should look like this:

cloning_lineage_correct

ghukill commented 6 years ago

Might also consider moving alteration of JobInput and input_job_ids to method, like alter_input_job_link()

ghukill commented 6 years ago

Cloning now supports downstream and rerunning of cloned Jobs.

Example of cloning A, which creates entirely new pipeline: screen shot 2018-10-02 at 8 12 46 am

Example of cloning single Job C, without downstream: screen shot 2018-10-02 at 8 13 28 am

Example of cloning C, with downstream, picking up unique filters applied to Jobs: screen shot 2018-10-02 at 8 14 12 am

Rerunnning was problematic at first, as Spark had race conditions with the newly created Jobs in the MySQL db. It connects in a completely different thread/context than Django, or even background tasks, and appears to not be bound by the same read/lock transactions. This is both good and bad. The solution was to poll until the Job is created, timing out after 60 seconds.

Preparing to merge into dev after some more testing, and downstream additions to deleting and moving Jobs.

ghukill commented 6 years ago

Closing, merging to dev.

ghukill commented 6 years ago

Reopening: if cloning multiple Jobs, including downstream, potential for clones being cloned:

screen shot 2018-10-02 at 10 46 04 am

It's hard to imagine a scenario where a newly created clone should, itself, be cloned. This is likely due to looping through selected Jobs, and running cjob.job.get_downstream_lineage(). When B clones, it creates a D (CLONE), which has a tether to C. Then, when C clones, it sees this new downstream Job and thinks that it should be cloned as well.

What should likely happen, during cloning, is to make sure the Job to be cloned is not already cloned, which could be determined by checking against clones dictionary.

This would have resulted in the resulting lineage, which is desired:

screen shot 2018-10-02 at 10 49 32 am

ghukill commented 6 years ago

Fixed. Closing (again).

MI-DPLA / combine

Ability to clone Job #307