dice-group / LIMES

Link Discovery Framework for Metric Spaces.
https://limes.demos.dice-research.org/
GNU Affero General Public License v3.0
125 stars 54 forks source link

Expose a (JENA) SPARQL Extension #116

Open Aklakan opened 6 years ago

Aklakan commented 6 years ago

While thinking about certain data integration tasks, I had the idea to just do everything using (Jena) SPARQL extensions, such as done at this project. Note, that Jena's extension system supports maven dependencies to register their own SPARQL extensions simply by including them - no further code necessary. This works by specifying a start-up class in the file src/main/resources/META-INF/services/org.apache.jena.system.JenaSubsystemLifecycle.

In principle, there could be a limes integration, where a pseudo SERVICE url is used to invoke limes. The body of the LIMES service would contain the configuration, i.e. the two concepts which to interlink, the properties to base the metrics expression on, the metric expression, and the threshold. In principle it could look something like:

SELECT ?x ?y {
    SERVICE <http://limes> {
        SERVICE <http://dbpedia.org/sparql> { ?x a :Airport ; rdfs:label ?xl }
        SERVICE <http://linkedgeodata.org/sparql> { ?y a :Aerodrome ; rdfs:label ?yl }
        FILTER(limes:trigrams(?xl, ?yl) > 0.9)
    }
}

What do you think about this?

Aklakan commented 6 years ago

I made progress on this issue, and I have the first link spec running via SPARQL.

As interlinking is conceptually just created a cartesian product between the entities (or records) of two sources, it can be represented in SPARQL as a JOIN and a FILTER on the condition. So in principle all interlinking of LIMES could be done only with SPARQL and some function extensions to compute the metrics. However, limes promises to speed this process up by clever indexing. Ideally, the following example below could be run with and without the SERVICE <plugin://limes> { ... } wrapper - whereas the former case should deliver the better performance using LIMES, and the latter rather naively constructs the cartesian product and is thus in accordance with the formal approach to interlinking. In order to make that happen, all metric and conversion functions would have to be registered to Jena's function library.

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX geom: <http://geovocab.org/geometry#>
PREFIX geos: <http://www.opengis.net/ont/geosparql#>
PREFIX lgdo: <http://linkedgeodata.org/ontology/>
PREFIX plugin: <plugin://>

SELECT * {
  SERVICE <plugin://limes> { ## Ideally the sparql query would still run without this SERVICE keyword
    SERVICE <http://linkedgeodata.org/sparql> {
      ?x a lgdo:RelayBox ; geom:geometry/geos:asWKT ?xl .
    }

    SERVICE <http://linkedgeodata.org/sparql> {
      ?y a lgdo:RelayBox ; geom:geometry/geos:asWKT ?yl .
    } 

    FILTER(plugin:geo_hausdorff(?xl, ?yl) < 0.0001)
  }
}

So I added two components:

The namespace and expression handling of limes appears quite awkward - i.e. needlessly complex - to me. For example function chains could be readily converted to jena expressions; e.g. property AS lowercase->someOtherFunc RENAME x could be represented in plain SPARQL as `BIND(plugin:someOtherFunc(plugin:lowercase(?o)) AS ?x) which allows direct reuse of Jena's ARQ machinery. So I have not yet understood the generic procedure to convert all limes transformations to SPARQL syntax.

kvndrsslr commented 4 years ago

I am interested in this, thanks @Aklakan for the suggestion (and sorry for the years late response haha). As I am about to rewrite large portions of LIMES for v2, this will be on my wish list. Also thinking about integration with dcat-suite for dataset management and sparql-integrate to extend the input options!

Aklakan commented 2 years ago

Just wondering whether there were any updates in that direction? In my group we are currently working on linking tasks and maybe it'd be worthwhile to pursue this topic again - especially considering that by now I have written several jena sparql extensions for e.g. the rdf-processing-toolkit in order to represent rather sophisticated data integration tasks as sequences of sparql statements - and linking is still on my wishlist. I need to check how much of my old code that integrated limes as a sparql service clause is still compatible with the current design of limes - but maybe after all this years I could finally provide a PR.