bbcarchdev / twine

An RDF workflow engine
https://bbcarchdev.github.io/twine/
Apache License 2.0
8 stars 3 forks source link

Twine

A workflow engine for processing RDF in customisable ways.

Apache 2.0 licensed

Table of contents

Summary

Twine is a harness for running RDF processing modules. Non-trivial use of Twine depends upon plug-ins, which can register two primary types of module: input handlers which take incoming data and convert them to an in-memory RDF graph model; and processors, which transform and/or export that model. All of the processing steps operate on sets of RDF graphs. A linear chain of one or more such processors constitutes a workflow. Inputs are invoked as needed, with data taken either from a Twine-managed message queue, or if using the Twine CLI, read from a file or stdin. Finally, there are also update modules, however these are currently undocumented.

Input handlers

Which input module is used to interpret an incoming message is determined by the message's associated MIME type. Upon loading, each plug-in may register one or more MIME types it can handle, and, when invoked, produces a data object that gets sent to the workflow. The message may contain anything, such as a direct RDF serialisation, locative information describing where some RDF data can be obtained, or some other flag that the input module can interpret so as to be able to provide the RDF.

Plug-ins are also able to register bulk input modules. If handling a given MIME type can result in multiple RDF graphs, a plug-in should register a bulk input module rather than a regular input module to parse those source data (see the geonames plug-in for an example). Each resultant graph is then sent through the workflow independently.

Processors

Processors are the heart of the Twine system. Use processors to add or remove triples from a graph, make inferences on the data (based on what's already in the graph, or from external cues), export the data to storage, use it as input to other executables, or perform any other activity. There are two built-in processors which are always available, sparql-get and sparql-put, which read from and write to the SPARQL store described in the active configuration file. Note that sparql-get is a processor, and as such, does not read from the message queue. It will discard the results of any previous processing and replace them with the RDF returned by the SPARQL store.

Workflows

A workflow is the ordered list of processors that some data will pass through — the data object being passed from one processor callback to the next until one fails or the final one succeeds. A workflow is defined by the workflow configuration setting in the active Twine configuration file, which names and sequences the processors to be used by that Twine instance. To support multiple workflows, you need to run multiple instances of Twine.

Message queue

Messages are inserted into the input queue either during process invocation, by Twine plug-ins, or from outside via inter-process communication. The Advanced Message Queuing Protocol is used. Each invocation of Twine contains its own message queue.

Plug-ins

As mentioned, plug-ins can perform various different functions within Twine. In principle, they can perform any computation upon some RDF data that one might conceive of. However, typically, they will:

Versatility is maximised when each module only performs one such task. A plug-in should register multiple modules and arrange them into a workflow to meet complex needs.

Twine comes with four plug-ins.

rdf

The rdf plug-in parses messages containing RDF serialised as TriG or N-Quads. The resulting graphs are then available for transformation by other modules, or simply for storing (for example via the built-in sparql-put processor). It also registers a processor module called dump-nquads, which writes N-Quads-serialised RDF to stdout.

s3

The s3 plug-in registers a single input module which handles messages with a type of application/x-s3-url, which are expected to contain a single URI in the form s3://BUCKETNAME/RESOURCEKEY. This URI is resolved to a HTTP URL using the S3 endpoint and credentials specified in the Twine configuration file, which is then fetched and the response body passed as-is back to Twine for processing, as though the S3 resource's contents had been present in the message queue.

xslt

The xslt plug-in is a configurable input module which applies XSLT stylesheets to transform generic XML documents into RDF/XML, which Twine natively handles. The plug-in registers itself as being able to handle MIME types specified in the [xslt] section of the Twine configuration file, and applies the corresponding stylesheet.

An example stylesheet and source XML document is provided as part of the plug-in.

geonames

The geonames plug-in accepts data in the GeoNames RDF dump format, which consists of a repeating sequence of the following, each separated by newlines:

For each pair, the plug-in simply reads the Geonames URL and transforms it into a graph URI by appending about.rdf, then reads the data on the next line into a named graph. The resulting models are pushed one-at-a-time into Twine for onward processing or storage.

With the geonames and rdf plug-ins both loaded, a GeoNames dump can be converted to N-Quads with:

$ twine -D twine:workflow=dump-nquads -t text/x-geonames-dump all-geonames-rdf.txt > geonames.nq

History

Twine was originally written to receive data in multiple formats, and with a small amount of format-specific code (or an XSLT stylesheet), push an RDF representation of that data into a graph store via SPARQL. It can still be used for this purpose via the sparql-put workflow processor.

Requirements

Twine requires the following common libraries which are available either out-of-the-box, or can be installed using native package management, on most Unix-like operating systems:

Note that in order to process source data in TriG or N-Quads formats correctly, you must use an up-to-date version of librdf. If you have an older version installed on your system, Twine will compile correctly, but will potentially not be able to locate any named graphs in the parsed RDF quads when attempting to import them.

Twine also requires the following libraries which are available from the BBC Archive Development Github account:

Finally, building Twine requires working GNU autotools (make, autoconf, automake, libtool, and so on) as well as a C compiler and your operating system’s usual developer packages.

Building from source

If building from a Git clone, you must first run:

$ git submodule update --init --recursive
$ autoreconf -i

Then, you can configure and build Twine in the usual way:

$ ./configure --prefix=/opt/twine
$ make
$ sudo make install

Use ./configure --help to see a list of available configure options. As an Automake-based project, Twine supports building in a separate directory from the sources and installing to a staging area (via make install DESTDIR=/some/path).

Source tree structure

Changes

Significant changes as of Twine 7.x:—

Configuration file structure

The configuration file structure has been simplified, although older settings will be used (emitting a warning) if they are present.

The majority of Twine’s configuration is now in the [twine] section, whose options apply to both the twine command-line utility, the twine-writerd daemon and the twine-inject tool. Individual options can be overridden by adding them to specific [twine-cli], [writerd] or [inject] sections. Plug-ins will continue to use their own sections.

See the configuration file example for further details.

Configurable workflows

Earlier versions of Twine applied a fixed workflow to an RDF graph being processed, which consisted of:—

As of Twine 7.x, pre- and post-processors have been deprecated, with module authors encouraged to implement generic graph processors instead. The fixed in-built workflow is now customisable in the configuration file, although the defaults are such that Twine will continue to apply the workflow described above until configured not to.

A workflow configuration consists simply of a list of comma-separated graph processors (usually registered by plug-ins) to apply to ingested RDF data. By default, this consists of the following in-built processors:—

The workflow can be altered by specifying a workflow=... value in the [twine] configuration section:—

[twine]
;; A simple workflow which performs no processing and simply PUTs RDF
;; graphs to a SPARQL server
workflow=sparql-put

;; A more complex workflow which passes graphs through a series of
;; loaded processors that manipulate the data, before storing the graph
;; via SPARQL PUT and finally invoking an additional indexing processor.
workflow=myplugin-rearrange,anotherplugin-process,sparql-put,elasticsearch-indexer

Note that the actual graph processors that are available depends upon the modules that you have loaded, and the above names are examples only.

Clustering

Twine now has the ability to operate as part of a libcluster-based cluster. While this does not meaningfully affect the operation of Twine itself, the cluster and node status information is made available to plug-ins implementing message queues and graph processors. For example, a message queue implementation might use the node details to filter inbound messages from the queue in to balance load across the cluster.

Cluster configuration is specified in the [twine] configuration section. If absent, Twine will configure itself to be part of a single-node (i.e., standalone) cluster.

The default values are as follows:—

[twine]
cluster-name=twine
cluster-verbose=no
node-index=0
cluster-size=1
environment=production
; registry=<registry URI>
; node-id=<some unique identifier>

To use a registry service, remove the node-index and cluster-size options and add a registry URI instead:

[twine]
cluster-name=twine
cluster-verbose=no
environment=live
registry=http://registry:2323/

API changes

The libtwine API has been reorganised with the aim of making it work more consistently and supporting new feature enhancements. Twine will warn when a plug-in is loaded which uses the older APIs and binary compatibility will be maintained for the forseeable future. Source compatiblity is currently being preserved (emitting compiler warnings where possible), but a future release will require the explicit definition of a macro in order to continue to make use of the deprecated APIs. Eventually the deprecated API prototypes will be removed from the libtwine.h header altogether.

Contributing

To contribute to Twine, fork this repository and commit your changes to the develop branch. For larger changes, you should create a feature branch with a meaningful name, for example one derived from the issue number.

Once you are satisfied with your contribution, open a pull request and describe the changes you’ve made.

License

Twine is licensed under the terms of the Apache License, Version 2.0

Copyright © 2014-2017 BBC.