Vocidex: Vocabulary Index

The goal of this project is to provide a reliable and high-quality search functionality over RDF Schemas and OWL Ontologies:

Search for classes, properties and vocabularies
Index based on ElasticSearch and Lucene
Bootstrap the index from a Linked Open Vocabularies (LOV) dump
Load individual RDFS and OWL files into the index
Search using the RESTful ElasticSearch API

Setting it all up

This is how you get an index up and running, and filled with data.

1. Installing ElasticSearch

The recommended way on OS X is using Homebrew. After Homebrew is set up and configured, simply run:

brew install elasticsearch

To do: Add instructions for other operating systems...

2. Starting the server

The easiest way for development use is this, using the provided configuration file:

elasticsearch -f -D es.config=elasticsearch.yml

The -f flag starts ElasticSearch in the foreground so you can stop it with Ctrl+C.

The -D option instructs ElasticSearch to use the elasticsearch.yml configration file. This configuration places data and logs into a subdirectory elasticsearch within this repository. For production use, you may want to use a different setup.

3. Building the CLI

You need Maven. Install it if necessary (brew install maven on OS X).

mvn package

This compiles and assembles the command-line app. The result is two things:

A gzipped version of the command-line app is generated in target/vocidex-cli.tar.gz and can be deployed wherever you like
An uncompressed version of the app is in target/vocidex-cli/vocidex and can be used directly

From inside the generated app's directory, the command-line tools can be run by invoking bin/appname.

4. Populating the index

# go to CLI build dir
cd target/vocidex-cli/vocidex
# Download LOV N-Quads dump as lov_aggregator.nq, takes a while
curl -o lov_aggregator.nq http://lov.okfn.org/dataset/lov/agg/lov_aggregator.rdf 
# Load it, takes a while
bin/index-lov elasticsearch localhost lov lov_aggregator.nq

5. Test if it worked

curl 'http://localhost:9200/lov/class,property,vocabulary/_search?q=test&pretty=1'

If this returns a longish JSON response, all is good.

Command-line tool documentation

`create-index`: Index initializer

This tool connects to an ElasticSearch cluster and initializes a new index for use with Vocidex. To see its syntax:

bin/create-index

Example invocation:

# Adds an index called 'lov' on the 'elasticsearch' cluster
bin/create-index elasticsearch localhost lov

`add-vocabulary`: The ElasticSearch Vocabulary Indexer

This tool reads an RDFS or OWL file, and indexes any terms defined therein in an ElasticSearch index. To see its syntax:

bin/add-vocabulary

Example invocation:

# Indexes SKOS into the 'skos' index on the 'elasticsearch' cluster
bin/add-vocabulary elasticsearch localhost skos http://www.w3.org/2004/02/skos/core

`index-lov`: The Linked Open Vocabularies Indexer

This tool populates an ElasticSearch index with the contents of the Linked Open Vocabularies dump. The dump can be obtained here. The file needs to be downloaded, and its extension changed to .nq because otherwise Jena gets confused. It really is an N-Quads file, not an RDF/XML file. To see the tool's syntax:

bin/index-lov

Example invocation:

# Download LOV dump with right name
curl -o lov_aggregator.nq http://lov.okfn.org/dataset/lov/agg/lov_aggregator.rdf
# Indexes the dump into an index called 'lov' on the 'elasticsearch' cluster
bin/index-lov elasticsearch localhost lov lov_aggregator.nq

Executing searches

Once the ElasticSearch index is populated, the standard REST-based ElasticSearch APIs can be used to run searches.

Simple keyword ("match") queries

The following example searches for classes, properties and vocabularies in the lov index, using the keyword test:

curl 'http://localhost:9200/lov/class,property,vocabulary/_search?q=test&pretty=1'

Equivalent to:

curl -XPOST 'http://localhost:9200/lov/class,property,vocabulary/_search?pretty=1' -d '{"query":{"match":{"_all":"test"}}}'

Autocompletion

This provides an autocomplete feature on pre-tokenized (using edge_ngram [1;100]) and indexed fields *.autocomplete.

curl -XPOST 'http://localhost:9200/lov/class,property/_search?pretty=1' -d '{
  "fields" : ["uri", "prefixed", "localName"],
  "query" : {
     "multi_match" : {
         "query": "foaf:",
         "fields": ["prefixed.autocomplete","uri.autocomplete"],
          "type" : "match_phrase"
     }
  }
}'

Developing

Initializing Eclipse files:

mvn eclipse:eclipse -DdownloadSources -DdownloadJavadocs

Running the tests:

mvn test

Use the issue tracker to discuss stuff, and feel free to submit pull requests.

Structure of indexed JSON documents

Vocidex works by creating a JSON document for each entity to be indexed (classes, properties, datatypes, vocabularies), and putting them into an ElasticSearch index. Here we document the structure of these JSON documents.

Note: “term array” is a JSON array of objects, each with uri and label keys.

All Terms (classes, properties, datatypes)

type: class, property, datatype
uri: absolute URI
uri.autocomplete: edge_ngram tokenized for autocomplete over uri
prefix: Namespace prefix, either provided by LOV or manually at index time; may be absent
localName: Part after the last hash/slash
localName.autocomplete: edge_ngram tokenized for autocomplete over localName
prefixed: Prefixed name (e.g., foaf:Person), or absent if no prefix
prefixed.autocomplete: edge_ngram tokenized for autocomplete over prefixed
label: rdfs:label or similar property, or a string synthesized from the local name
comment: rdfs:comment or similar property; may be absent
vocabulary: LOV metadata about the vocabulary; may be absent uri prefix label homepage

Class

Term keys as listed above, plus:

superclasses: term array
disjointClasses: term array
equivalentClasses: term array

Property

Term keys as listed above, plus:

domains: term array
ranges: term array; each member also has either an isDatatype or isClass field with value true
superproperties: term array
inverseProperties: term array
equivalentProperties: term array
isAnnotationProperty: boolean
isObjectProperty: boolean
isDatatypeProperty: boolean
isFunctionalProperty: boolean
isInverseFunctionalProperty: boolean
isTransitiveProperty: boolean
isSymmetricProperty: boolean

Datatype

Term keys as listed above

Vocabulary

type: vocabulary
uri: absolute URI as per LOV
uri.autocomplete: edge_ngram tokenized for autocomplete over uri
prefix: conventional prefix as per LOV
prefix.autocomplete: edge_ngram tokenized for autocomplete over prefix
label: as for terms
shortLabel: curated short-form label as per LOV; may be absent
comment: as for terms
homepage: URL from LOV metadata; may be absent

cygri / vocidex

readme

Vocidex: Vocabulary Index

Setting it all up

1. Installing ElasticSearch

2. Starting the server

3. Building the CLI

4. Populating the index

5. Test if it worked

Command-line tool documentation

`create-index`: Index initializer

`add-vocabulary`: The ElasticSearch Vocabulary Indexer

`index-lov`: The Linked Open Vocabularies Indexer

Executing searches

Simple keyword ("match") queries

Autocompletion

Developing

Structure of indexed JSON documents

All Terms (classes, properties, datatypes)

Class

Property

Datatype

Vocabulary

cygri / vocidex

readme

Vocidex: Vocabulary Index

Setting it all up

1. Installing ElasticSearch

2. Starting the server

3. Building the CLI

4. Populating the index

5. Test if it worked

Command-line tool documentation

create-index: Index initializer

add-vocabulary: The ElasticSearch Vocabulary Indexer

index-lov: The Linked Open Vocabularies Indexer

Executing searches

Simple keyword ("match") queries

Autocompletion

Developing

Structure of indexed JSON documents

All Terms (classes, properties, datatypes)

Class

Property

Datatype

Vocabulary

`create-index`: Index initializer

`add-vocabulary`: The ElasticSearch Vocabulary Indexer

`index-lov`: The Linked Open Vocabularies Indexer