DaniFdezAlvarez / shexer

Apache License 2.0
26 stars 2 forks source link

sheXer

This library can be used to perform automatic extraction of shape expressions (ShEx) or Shapes Constraint Language (SHACL) for a target RDF grpah. Please, feel free to add an issue to this repository if you find any bug in sheXer or if you have a feature request.

Language:

Pyversions

Citation

Use this work in case you want to cite this software: Automatic extraction of shapes using sheXer.

If you want to read the paper but cannot access the full-content using the previous link, there is a preprint available in Researchgate.

However, please, be aware that this software capabilities' have evolved and improved since the publication of the mentioned paper.

Installation

sheXer can be installed using pip:

$ pip install shexer

Iy you want to install sheXer by source, all its external dependencies are listed in the file requirements.txt. You can install them all as well using pip:

$ pip install -r requirements.txt

sheXer includes a package to deploy a wer service exposing sheXer with a REST API. In case you are not interested in deploying this web service, you don't need to install any dependency related to Flask.

Features

Experimental results

In the folder experiments, you can see some results of applying this tool over different graphs with different configurations.

Example code

The following code takes the graph in raw_graph and extracts shapes for instances of the classes http://example.org/Person and http://example.org/Gender. The input file format in n-triples and the results are serialized in ShExC to the file shaper_example.shex.

from shexer.shaper import Shaper
from shexer.consts import NT, SHEXC, SHACL_TURTLE

target_classes = [
    "http://example.org/Person",
    "http://example.org/Gender"
]

namespaces_dict = {"http://www.w3.org/1999/02/22-rdf-syntax-ns#": "rdf",
                   "http://example.org/": "ex",
                   "http://weso.es/shapes/": "",
                   "http://www.w3.org/2001/XMLSchema#": "xsd"
                   }

raw_graph = """
<http://example.org/sarah> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.org/Person> .
<http://example.org/sarah> <http://example.org/age> "30"^^<http://www.w3.org/2001/XMLSchema#int> .
<http://example.org/sarah> <http://example.org/name> "Sarah" .
<http://example.org/sarah> <http://example.org/gender> <http://example.org/Female> .
<http://example.org/sarah> <http://example.org/occupation> <http://example.org/Doctor> .
<http://example.org/sarah> <http://example.org/brother> <http://example.org/Jim> .

<http://example.org/jim> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.org/Person> .
<http://example.org/jim> <http://example.org/age> "28"^^<http://www.w3.org/2001/XMLSchema#int> .
<http://example.org/jim> <http://example.org/name> "Jimbo".
<http://example.org/jim> <http://example.org/surname> "Mendes".
<http://example.org/jim> <http://example.org/gender> <http://example.org/Male> .

<http://example.org/Male> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.org/Gender> .
<http://example.org/Male> <http://www.w3.org/2000/01/rdf-schema#label> "Male" .
<http://example.org/Female> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.org/Gender> .
<http://example.org/Female> <http://www.w3.org/2000/01/rdf-schema#label> "Female" .
<http://example.org/Other> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.org/Gender> .
<http://example.org/Other> <http://www.w3.org/2000/01/rdf-schema#label> "Other gender" .
"""

input_nt_file = "target_graph.nt"

shaper = Shaper(target_classes=target_classes,
                raw_graph=raw_graph,
                input_format=NT,
                namespaces_dict=namespaces_dict,  # Default: no prefixes
                instantiation_property="http://www.w3.org/1999/02/22-rdf-syntax-ns#type")  # Default rdf:type

output_file = "shaper_example.shex"

shaper.shex_graph(output_file=output_file,
                  acceptance_threshold=0.1)

print("Done!")

By default, sheXer generates ShExC. If you want to produce SHACL, indicate it as a param in the shex_graph method as follows:

# Use the same imports and param definition of the previous example code

output_file = "shaper_example.ttl"

shaper.shex_graph(output_file=output_file,
                  acceptance_threshold=0.1,
                  output_format=SHACL_TURTLE)

print("Done!")

You can also find some examples of how to process Wikidata with sheXer in this Jupyter notebook.

The Class Shaper

Most of the features provided by this software are reachable using the class Shaper. As it is shown in the previous example code, one must get an instance of Shaper with some params and execute a method to perform the schema extraction.

init

The init method of Shaper includes many params, being optional most of them. Don't panic due to the high number of params. You just need to focus on three main questions:

You'll find a param in the init of Shaper to provide the information in the way you want. Use it using a keyword when creating your instance of Shaper (as in the example code of this document) and just forget about the rest. Shaper has a default value for them all.

The following list describes each param of the init of Shaper:

Params to define target shapes:

You must indicate al least one way to identify target instances and the shapes that should be generated. Some of this params are compatible, some others are not. For example, sheXer do not allow to indicate target classes and to activate all-classes mode, as it is contradictory. However, you can provide a shape map to make custom node aggrupations and use all_classes mode too, so you obtain shapes for those groupings and for each class.

Params to provide the input

You must provide at least an input: a file, a string, an endpoint, a remote graph... you may also want to tune some other aspects, such as the format of the input or namespace-prefix pairs to be used.

Params to tune the shexing process

All this parameters have a default value so you do not need to use any of them. But you can modify the schema extraction in many different ways.

Params to tune some features of the output

Again, all these params have a default value and you don't need to worry about them unless you want to tune the output.

Method shex_graph

The method shex_graph of shexer triggers all the inference process and gives back a result. It receives several parameters, being optional some of them: