Job tracking - Githubissues

nevali commented 7 years ago

Optionally support a libsql connection URI which will be used to track jobs as they are processed by twine-writerd or twine-cli.

A job consists of:

A UUID to identify it
Optional a parent UUID
A URI to identify it (which may simply be a urn:uuid: representation of the job UUID, if nothing else is suitable, otherwise it'll be the canonical source or target URI, depending upon the processing pipeline; workflow components may update it accordingly during processing)
Timestamps for added and updated
A status: WAITING, ACTIVE, ABORTED (by the user), COMPLETE, FAILED, ERRORS (partial failure)
A status annotation (free-text) which may be set to indicate the failure reason
If active, the cluster/instance details of the node processing the job (preserved for diagnosis once set)
Processing item x of y progress indicators (particularly for bulk ingests from filesystem sources)

UUIDs should be where possible taken from the source, if it incorporates one into its identification, or generated on-the-fly if this is not possible.

A job stack should be maintained internally to libtwine in order to track parent/child relationships, rather than requiring it to be made explicit.

As an example, an ingest of N-Quads from a file, processing with spindle-correlate might yield the following:

A job is created in state WAITING with a newly-generated UUID and a file:/// URI
The N-Quads are parsed and the number of graphs determined; the job is updated to state ACTIVE, with progress set to 0 of number-of-graphs
For each graph that is correlated by Spindle, progress is updated, and a new child job is created in state WAITING, using the Spindle-generated UUID and URI
Once processing of the N-Quads is complete, the job status is updated to COMPLETE

As spindle-generate later processes its queue of items, it performs the following:

A job is created in state WAITING using the Spindle-generated UUID and URI; if it already exists, its parentage is preserved (thus, if the job originated from an ingest as described above, the proxy-generation step maintains the parent-child relationship allowing for ready visualisation
As the proxy is generated, its status is updated accordingly

With this arrangement, a small number of relatively simple SQL queries can result in progress tracking and volumetrics across a processing cluster.

Open question: how would Twine know when to preserve versus replace the parent of a job?

Perhaps it could be as simple as user action (i.e., twine-cli) taking precedence over an on-going process — thus, a queue-driven twine-writerd will only set the parent of a job if it's newly-created, whereas twine-cli will always override it. Both would create an overarching job for their processing runs, whether that's from a file or a queue.

Tracked as RESDATA-1279

nevali commented 7 years ago

Sketched interface to be implemented as part of libtwine to support this functionality:

typedef /*opaque*/ struct twine_job_struct TWINEJOB;
typedef enum
{
  TJS_WAITING,
  TJS_ACTIVE,
  TJS_ABORTED,
  TJS_COMPLETED,
  TJS_FAILED,
  TJS_ERRORS
} TWINEJOBSTATUS;

typedef enum
{
  TJP_PRESERVE,
  TJP_FORCE
} TWINEJOBPARENTAGE;

/* This is a relatively low-level libtwine API: the only side-effects are limited to
 * twine_job_create() creating or updating rows depending upon the parentage
 * mode of the current parent job and whether a row for that UUIS exists or not.
 */
TWINEJOB *twine_job_create(const uuid_t uuid, const char *restrict uri, CLUSTER *restrict /*optional*/ cluster);
int twine_job_close(TWINEJOB *job);
const char *twine_job_uristr(TWINEJOB *job);
int twine_job_set_uristr(TWINEJOB *restrict job, const char *restrict uri);
/* NB: possibly require URI and librdf_uri variants of the above */
int twine_job_set_parentage(TWINEJOB *job, TWINEJOBPARENTAGE mode);
int twine_job_update(TWINEJOB *restrict job, TWINEJOBSTATUS status, const char *restrict /*optional*/ annotation);
int twine_job_set_progress(TWINEJOB *job, int /*optional*/ current, int /*optional*/ total);
/* NB: twine_job_set_progress() uses -1 as a sentinel to indicate NULL integer values;
 * these will cause the job status to be left unchanged: twine_job_set_progress(job, -1, -1);
 * is therefore a no-op
 */

nevali commented 7 years ago

Arguably the core state-tracking mechanism of this should be moved to bbcarchdev/libcluster itself, and Twine simply employs it.

bbcarchdev / twine

Job tracking #31