bbcarchdev / anansi

A Linked Open Data Web crawler
https://bbcarchdev.github.io/anansi/
Apache License 2.0
0 stars 0 forks source link

Anansi

A web crawler and crawling framework

Current build status Apache 2.0 licensed Implemented in C Follow @RES_Project

Anansi is a web crawler which includes specific support for Linked Data and can be operated out-of-the-box or used as a framework for developing your own crawling applications.

This software was developed as part of the Research & Education Space project and is actively maintained by a development team within BBC Design and Engineering. We hope you’ll find this project useful!

Table of Contents

Requirements

Optionally, you may also wish to install:—

On Debian-based systems, the following will install those required packages which are generally-available in APT repositories:

  sudo apt-get install -qq libjansson-dev libmysqlclient-dev libcurl4-gnutls-dev libxml2-dev librdf0-dev libltdl-dev uuid-dev automake autoconf libtool pkg-config clang build-essential xsltproc docbook-xsl-ns

Anansi has not yet been ported to non-Unix-like environments, and will install as shared libraries and command-line tools on macOS rather than frameworks and LaunchDaemons.

It ought to build inside Cygwin on Windows, but this is untested.

Contributions for building properly with Visual Studio or Xcode, and so on, are welcome (provided they do not significantly complicate the standard build logic).

Using Anansi

Configuring the crawler

The first time you install Anansi, an example crawl.conf will be installed to $(sysconfdir) (by default, /usr/local/etc).

Invoking the crawler

The crawl daemon is installed by default as $(sbindir)/crawld, which will typically be /usr/local/sbin/crawld.

After you’ve initially configured the crawler, you should perform any database schema updates which may be required:

$ /usr/local/sbin/crawld -S

This happens automatically when you launch it, but the -S option will give you an opportunity to see the results of a first run without examining log files, and will cause the daemon to terminate after ensuring the schema is up to date.

To run the crawler in the foreground, with debugging enabled:

$ /usr/local/sbin/crawld -d

Or to run it in the foreground, without debug-level verbosity:

$ /usr/local/sbin/crawld -f

Alternatively, to run in the background:

$ /usr/local/sbin/crawld -f

If you want to perform a single test fetch of a URI using your current configuration, you can do this with:

$ /usr/local/sbin/crawld -t http://example.com/somelocation

Once you’ve configured the crawler, you can add a URI to its queue using the crawler-add utility, installed as $(bindir)/crawler-add (typically /usr/local/bin/crawler-add). Note that crawld does not have to be running in order to add URIs to the queue.

Components

Bugs and feature requests

If you’ve found a bug, or have thought of a feature that you would like to see added, you can file a new issue. A member of the development team will triage it and add it to our internal prioritised backlog for development—but in the meantime we welcome contributions and encourage forking.

Building from source

You will need git, automake, autoconf and libtool. Also see the Requirements section.

$ git clone git://github.com/bbcarchdev/anansi.git
$ cd anansi
$ git submodule update --init --recursive
$ autoreconf -i
$ ./configure --prefix=/some/path
$ make
$ make check
$ sudo make install

Automated builds

We have configured Travis to automatically build and invoke the tests on Anansi for new commits on each branch. See .travis.yml for the details.

You may wish to do similar for your own forks, if you intend to maintain them.

The debian directory contains the logic required to build a Debian package for Anansi, except for the changelog. This is used by the system that auto-deploys packages for the production Research & Education Space, and so if you need a modified version to suit your own deployment needs, it’s probably easiest to maintain a fork of this repository with your changes in.

Contributing

If you’d like to contribute to Anansi, fork this repository and commit your changes to the develop branch.

For larger changes, you should create a feature branch with a meaningful name, for example one derived from the issue number.

Once you are satisfied with your contribution, open a pull request and describe the changes you’ve made and a member of the development team will take a look.

Information for BBC Staff

This is an open source project which is actively maintained and developed by a team within Design and Engineering. Please bear in mind the following:—

Finally, thanks for taking a look at this project! We hope it’ll be useful, do get in touch with us if we can help with anything (“RES-BBC” in the GAL, and we have staff in BC and PQ).

License

Anansi is licensed under the terms of the Apache License, Version 2.0