bbcarchdev / twine

An RDF workflow engine
https://bbcarchdev.github.io/twine/
Apache License 2.0
8 stars 3 forks source link

Job tracking #31

Open nevali opened 7 years ago

nevali commented 7 years ago

Optionally support a libsql connection URI which will be used to track jobs as they are processed by twine-writerd or twine-cli.

A job consists of:

UUIDs should be where possible taken from the source, if it incorporates one into its identification, or generated on-the-fly if this is not possible.

A job stack should be maintained internally to libtwine in order to track parent/child relationships, rather than requiring it to be made explicit.

As an example, an ingest of N-Quads from a file, processing with spindle-correlate might yield the following:

As spindle-generate later processes its queue of items, it performs the following:

With this arrangement, a small number of relatively simple SQL queries can result in progress tracking and volumetrics across a processing cluster.

Open question: how would Twine know when to preserve versus replace the parent of a job?

Perhaps it could be as simple as user action (i.e., twine-cli) taking precedence over an on-going process — thus, a queue-driven twine-writerd will only set the parent of a job if it's newly-created, whereas twine-cli will always override it. Both would create an overarching job for their processing runs, whether that's from a file or a queue.

Tracked as RESDATA-1279

nevali commented 7 years ago

Sketched interface to be implemented as part of libtwine to support this functionality:

typedef /*opaque*/ struct twine_job_struct TWINEJOB;
typedef enum
{
  TJS_WAITING,
  TJS_ACTIVE,
  TJS_ABORTED,
  TJS_COMPLETED,
  TJS_FAILED,
  TJS_ERRORS
} TWINEJOBSTATUS;

typedef enum
{
  TJP_PRESERVE,
  TJP_FORCE
} TWINEJOBPARENTAGE;

/* This is a relatively low-level libtwine API: the only side-effects are limited to
 * twine_job_create() creating or updating rows depending upon the parentage
 * mode of the current parent job and whether a row for that UUIS exists or not.
 */
TWINEJOB *twine_job_create(const uuid_t uuid, const char *restrict uri, CLUSTER *restrict /*optional*/ cluster);
int twine_job_close(TWINEJOB *job);
const char *twine_job_uristr(TWINEJOB *job);
int twine_job_set_uristr(TWINEJOB *restrict job, const char *restrict uri);
/* NB: possibly require URI and librdf_uri variants of the above */
int twine_job_set_parentage(TWINEJOB *job, TWINEJOBPARENTAGE mode);
int twine_job_update(TWINEJOB *restrict job, TWINEJOBSTATUS status, const char *restrict /*optional*/ annotation);
int twine_job_set_progress(TWINEJOB *job, int /*optional*/ current, int /*optional*/ total);
/* NB: twine_job_set_progress() uses -1 as a sentinel to indicate NULL integer values;
 * these will cause the job status to be left unchanged: twine_job_set_progress(job, -1, -1);
 * is therefore a no-op
 */
nevali commented 7 years ago

Arguably the core state-tracking mechanism of this should be moved to bbcarchdev/libcluster itself, and Twine simply employs it.