bbcarchdev / acropolis

A toolkit for building knowledge graphs
https://bbcarchdev.github.io/acropolis
8 stars 6 forks source link

Acropolis

The Research & Education Space software stack.

Current build status Apache 2.0 licensed Implemented in C Follow @RES_Project

This “umbrella” project has been assembled to make it easier to maintain and run tests on the all of the individual components which make up the Acropolis stack.

This software was developed as part of the Research & Education Space project and is actively maintained by a development team within BBC Design and Engineering. We hope you’ll find this project useful!

Table of Contents

Requirements

In order to build Acropolis in its entirety, you will need:—

Optionally, you may also wish to install:—

On a Debian-based system, the following should install all of the necessary dependencies:

$ sudo apt-get install -qq libjansson-dev libmysqlclient-dev libpq-dev libqpid-proton-dev libcurl4-gnutls-dev libxml2-dev librdf0-dev libltdl-dev uuid-dev libfcgi-dev automake autoconf libtool pkg-config libcunit1-ncurses-dev build-essential clang xsltproc docbook-xsl-ns

Acropolis has not yet been ported to non-Unix-like environments, and will install as shared libraries and on macOS rather than a framework.

Much of it ought to build inside Cygwin on Windows, but this is untested.

Contributions for building properly with Visual Studio or Xcode, and so on, are welcome (provided they do not significantly complicate the standard build logic).

Using Acropolis

Once you have built and installed the Acropolis stack, you probably want to do something with it.

Acropolis consists of a number of different individual components, including libraries, command-line tools, web-based servers, and back-end daemons. They are:—

Note that this repository exists for development and testing purposes only: in a production environment, each component is packaged and deployed individually.

Components

Anansi

Anansi is a web crawler. It uses a relational database to track URLs that will be fetched, their status, and cache IDs. Anansi can operate in resizeable clusters of up to 256 nodes via libcluster.

Anansi has the notion of a processor —a named implementation of the "business logic" of evaluating resources that have been retrieved and using them to add new entries to the queue.

In the Research & Education Space, Anansi is configured to use the lod (Linked Open Data) processor, which:

Twine

Twine is a modular RDF-oriented processing engine. It can be configured to do a number of different things, but its main purpose is to fetch some data, convert it if necessary, perform some processing, and then put it somewhere.

Twine can operate as a daemon, which will continuously fetch data to process from a queue of some kind (see libmq), or it can be invoked from the command-line to ingest data directly from a file.

Twine is extended through two different kinds of loadable modules (which reside in ${libdir}/twine, by default /opt/res/lib/twine):

Twine ships with a number of modules for interacting with SPARQL servers, XML data ingest via XSLT transform, as well as parsing and outputting RDF. More information can be found in the Twine README.

Twine is always configured with a workflow: a list of processors which should be invoked in turn for each item of data being processed. Like all configuration options, the workflow can be specified on the command-line.

In the Research & Education Space, the Spindle project provides additional Twine modules which implement the key logic of the platform.

Quilt

Quilt is a Linked Data server designed to efficiently serve RDF data in a variety of serialisations, including templated HTML. Like Twine, Quilt is modular (see ${libdir}/quilt), and in particular modules are used to provide engine implementations—these are the code responsible for populating an RDF model based upon the request parameters (Quilt itself then handles the serving of that model). The Spindle project includes a Quilt module which implements the Research & Education Space public API.

Spindle

Spindle is the core of the Research & Education Space. It includes three processor modules for Twine:

It also includes a module for Quilt, which uses the data from spindle-correlate and spindle-generate in order to provide the Research & Education Space API.

Running the stack

Annotated configuration files are provided in the config directory which should help get you started. By default, the components expect to find these files in /opt/res/etc, but this can be altered by specifying the --prefix or --sysconfdir options when invoking the top-level configure script.

Requirements

You will need:

In production, the Research & Education Space uses PostgreSQL, RADOS, and 4store. It has been successfully used in development environments with FakeS3 and alternative SPARQL servers.

Running Anansi

Important! Do not run the Anansi daemon (crawld) without first carefully checking the configuration to ensure that it doesn’t simply start crawling the web unchecked. If you're using the lod engine, you can enforce restrictions

Anansi will use the PostgreSQL database you provide it to store the queue and cache state, and either an "S3" bucket (see above) or a filesystem path as a cache store, which will contain both metadata and the actual content of retrieved resources.

Once configured, you can invoke crawld -t <URI> to perform a single fetch of a URI that you specify. Depending upon the processor and the resource itself, this may cause other URIs to be added to the crawler queue in the database, but the -t option will cause the crawld process will exit once the URI specified on the command-line has been fetched.

Running Twine

Twine itself can be configured in many different ways, but in the Research & Education Space, there are two kinds of Twine instance:

For development and testing, the easiest way to emulate the production configuration is to have two Twine configurations, one for each of the two instance types. You can then run the Twine daemon in the background performing the "generate" tasks, while invoking the Twine command-line utility using the correlate configuration to process N-Quads files on disk.

Based upon the sample configuration files, you should be able to do something like this:

$ sudo /opt/res/sbin/twine-writerd -c /opt/res/etc/twine-generate.conf
$ /opt/res/bin/twine -c /opt/res/etc/twine-correlate.conf some-data.nq

Note that using sudo isn't required if the twine-writerd PID file can be written as an unprivileged user.

Inside Acropolis

Information about the design of the stack, its principles of operation, and how to use the Research & Education Space can be found in our book for developers and collection-holders, Inside Acropolis.

The live production API endpoint for the Research & Education Space can be found at http://acropolis.org.uk/

Bugs and feature requests

If you’ve found a bug, or have thought of a feature that you would like to see added, you can file a new issue. A member of the development team will triage it and add it to our internal prioritised backlog for development—but in the meantime we welcome contributions and encourage forking.

Building from source

You will need git, automake, autoconf and libtool. Also see the Requirements section.

$ git clone git://github.com/bbcarchdev/acropolis.git
$ cd acropolis
$ git submodule update --init --recursive
$ autoreconf -i
$ ./configure --prefix=/some/path --enable-debug
$ make
$ make check
$ sudo make install

If you don’t specify an installation prefix to ./configure, it will default to /opt/res.

Important: The Acropolis repository incorporates its various components as Git submodules, which means that each subdirectory will point to a specific commit. This allows us to ensure that a fresh clone of the repository points to a stable set of commit.

However, if you are using this tree to modify or maintain Acropolis, you will probably want to track the develop branches in each submodule instead of the stable commit pointed to by default.

You can do this yourself by invoking git fetch origin develop:develop && git checkout develop in each submodule, or if you have already configured the tree, you can attempt make checkout, which will attempt to do the same thing automatically.

When to re-run autoreconf

If any of the following occur, either through your own changes, or because a git pull or git checkout caused them, you should re-run autoreconf -i in the affected parts of the tree (directories or parent directories which contain a configure.ac), or from the top-level:—

Re-building part of the tree

The Automake-based build logic is designed to allow you to rebuild almost any part of the tree whenever you need to: if you are just working on Quilt, for example, you may find yourself making and testing changes almost exclusively within the quilt subdirectory for the duration of that work.

If you know your build logic changes are restricted to one particular submodule, you can change into the submodule directory and run the following:

quilt $ autoreconf -i && ./config.status --recheck && make clean
quilt $ make

Automated builds

We have configured Travis to automatically build and invoke the tests on the stack for new commits on each branch. See .travis.yml for the details.

You may wish to do similar for your own forks, if you intend to maintain them.

Contributing

If you’d like to contribute to Acropolis, fork this repository and commit your changes to the develop branch.

For larger changes, you should create a feature branch with a meaningful name, for example one derived from the issue number.

Once you are satisfied with your contribution, open a pull request and describe the changes you’ve made and a member of the development team will take a look.

Information for BBC Staff

This is an open source project which is actively maintained and developed by a team within Design and Engineering. Please bear in mind the following:—

Finally, thanks for taking a look at this project! We hope it’ll be useful, do get in touch with us if we can help with anything (“RES-BBC” in the GAL, and we have staff in BC and PQ).

License

Copyright © 2017 BBC

The majority of the Acropolis stack is licensed under the terms of the Apache License, Version 2.0.

However, see the documentation within individual submodules for exceptions to this and information about third-party components which have been incorporated into the tree.