Service for the DataONE Linked Open Data graph.
This repository contains a deployable service that continuously updates the DataOne Linked Open Data graph. It was originally developed as a provider of data for the GeoLink project, but now is a core component of the DataONE services. The service uses Docker Compose to manage a set of Docker containers that run the service. The service is intended to be deployed to a virtual machine and run with Docker Compose.
The main infrastructure of the service is composed of four Docker Compose services:
web
: An Apache httpd front-end serving static files and also reverse-proxying to an Apache Tomcat server running a GraphDB Lite instance which is bundled with OpenRDF Sesame Workbench.scheduler
: An APSchduler process that schedules jobs (e.g., update graph with new datasets) on the worker
at specified intervalsworker
: An RQ worker process to run scheduled jobsredis
: A Redis instance to act as a persistent store for the worker
and for saving application stateIn addition to the core infrastructure services (above), a set of monitoring/logging services are spun up by default. As of writing, these are mostly being used for development and testing but they may be useful in production:
elasticsearch
: An ElasticSearch instance to store, index, and support analysis of logslogstash
: A Logstash instance to facilitate the log pipelinekibana
: A Kibana instance to search and vizualize logslogspout
: A Logspout instance to collect logs from the Docker containerscadvisor
: A cAdvisor instance to monitor resource usage on each Docker containerrqdashboard
: An RQ Dashboard instance to monitor jobs.As the service runs, the graph store will be continuously updated as datasets are added/updated on DataOne. Another scheduled job exports the statements in the graph store and produces a Turtle dump of all statements at http://dataone.org/d1lod.ttl.
.
├── d1lod # Python package which supports other services
├── docs # Detailed documentation beyond this file
├── logspout # Custom Dockerfile for logspout
├── logstash # Custom Dockerfile for logstash
├── redis # Custom Dockerfile for Redis
├── rqdashboard # Custom Dockerfile for RQ Dashboard
├── scheduler # Custom Dockerfile for APScheduler process
├── web # Apache httpd + Tomcat w/ GraphDB
├── worker # Custom Dockerfile for RQWorker process
└── www # Local volume holding static files
Note: In order to run the service without modification, you will need to create a 'webapps' directory in the root of this repository containing 'openrdf-workbench.war' and 'openrdf-sesame.war':
.
├── webapps
│ ├── openrdf-sesame.war
└ └── openrdf-workbench.war
These aren't included in the repository because we're using GraphDB Lite which doesn't have a public download URL. These WAR files can just be the base Sesame WAR files which support a variety of backend graph stores but code near https://github.com/ec-geolink/d1lod/blob/master/d1lod/d1lod/sesame/store.py#L90 will need to be modified correspondingly.
For an overview of what concepts the graph contains, see the mappings documentation.
Assuming you are set up to to use Docker (see the User Guide to get set up):
git clone https://github.com/DataONEorg/slinky
cd slinky
# Create a webapps folder with openrdf-sesame.war and openrdf-workbench.war (See above note)
docker-compose up # May take a while
After running the above docker-compose
command, the above services should be started and available (if appropriate) on their respective ports:
$DOCKER_HOST:8080/openrdf-workbench/
$DOCKER_HOST:5601
$DOCKER_HOST:8888
Where $DOCKER_HOST
is localhost
if you're running Docker natively or some IP address if you're running Docker Machine. Consult the Docker Machine documentation to find this IP address. When deployed on a Linux machine, Docker is able to bind to localhost under the default configuration.
Tests are written using PyTest. Install PyTest with
pip install pytest
cd d1lod
py.test
As of writing, only tests for the supporting Python package (in directory './d1lod') have been written. Note: The test suite assumes you have a running instance of OpenRDF Sesame running on http://localhost:8080 which means the Workbench is located at http://localhost:8080/openrdf-workbench and the Sesame interface is available at http://localhost:8080/openrdf-sesame.