MI-DPLA / combine

Combine /kämˌbīn/ - Metadata Aggregator Platform
MIT License
26 stars 11 forks source link

Re-run Job, and all downstream Jobs #280

Closed ghukill closed 5 years ago

ghukill commented 5 years ago

This would introduce an interesting, and perhaps very central new function to Combine, the ability to re-run Jobs "in-place".

Currently, the data model encourages running of new Jobs to re-harvest from endpoint foo. But, this has the disadvantage of requiring users to re-configure input filters, field mapping, transformations, validations, etc. This would a dataflow mentality, where Jobs are more like setting up nodes in a pipeline.

The components for this are largely in place:

Some thoughts:

One that merge lineage, that might be difficult. Imagine the following situation:

merge_jobs

When would j5 get fired? It's more complex than, "jx was an input job for jy, so fire jx before jy," but it seems like it's something that could be figured out for a single Job, and all it's downstream "lineage".

This would provide means of setting up a "pipeline" like the following:

Four Jobs in total, with a handful of configurations. Then, when it's determined there should be an update from the OAI endpoint, from somewhere in that Job there would be a "re-run" or "re-run Job stream", something to that effect.

ghukill commented 5 years ago

Largely done in jobrerun branch.

Before merge to dev, looking at what options might be available for rerunning a Job and tweaking parameters. If new validations are run, or re-indexed, those parameters will "stick" for rerun (as all operate from job_details). But Jobs have input filters, RITS settings, etc., that could all be updated when rerun. And in fact they can, if job_details are manually updated.

But this needs a GUI.

The job_optional_processing.html template is just about precisely the settings that would be included, and might be a great option for exposing these settings again. This is because, in many ways, it does not make sense to alter these parameters unless the Job is rerunning.

The only outlying scenario might be a very large job, that is part of a "pipeline". A user may not want to rerun the Job at that moment, but tweak settings in preparation for a pipeline rerun. In this scenario, it would be advantageous to have the parameters editable, with the understanding they would not take effect until a rerun.

The disadvantage to this, would be altering configurations without the Job reflecting them. If the pipeline is not rerun, then the parameters are incorrect and misleading.

ghukill commented 5 years ago

Close to merge with dev, but holding off until pathway established for static Jobs. Approach may, when uploads are used (which is common), to save payload_dir somewhere else than /tmp?

Works as-is, but if server reboots, static harvests looking for payload_dir in /tmp will be gone.

ghukill commented 5 years ago

Static, addressed.

ghukill commented 5 years ago

Preparing to merge to dev.

Note: a form GUI not implemented for updating Job parameters -- job_details -- but JSON editor available under "Job Parameters" that can be used in a pinch. If there is a need for that ability can add later, but this allows for advanced tweaking of Job params (which will be used in a rerun) if need be.

Closing.