RMLio / rmlmapper-java

The RMLMapper executes RML rules to generate high quality Linked Data from multiple originally (semi-)structured data sources
http://rml.io
MIT License
146 stars 61 forks source link

Add support for XPath v2 and v3 with Saxon-HE library #158

Closed julianrojas87 closed 2 years ago

julianrojas87 commented 2 years ago

What is this?

This PR integrates the Saxon-HE v11 library into the mapper, as a XML and XPath parser, to bring support for more advanced XPath capabilities.

Motivation

Unlike the currently used XML parser, which only supports XPath v1.0 and a handful of functions, Saxon supports up to XPath v3.1.

There is a significant difference in terms of expressivity between XPath v1 and v3, which is often needed when generating RDF from XML sources.

Tests

This PR passes all existing XML test-cases and adds 2 new ones:

It is also able to handle gracefully XML namespaces including this default namespace issue. However the implementation could be improved.

Performance

Saxon is supposed to be more performant than the default Java XML implementation.

So I tested this build with a relatively large XML source (~300Mb), namely an OpenStreetMap extract for the city of Ghent and the results are very promising, both in terms of performance and memory footprint:

I used the following RML mapping:

@prefix rr:   <http://www.w3.org/ns/r2rml#> .
@prefix osm:  <https://w3id.org/openstreetmap/terms#> .
@prefix geo:  <http://www.w3.org/2003/01/geo/wgs84_pos#> .
@prefix xsd:  <http://www.w3.org/2001/XMLSchema#> .
@prefix rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix rml:  <http://semweb.mmlab.be/ns/rml#> .
@prefix ql:   <http://semweb.mmlab.be/ns/ql#> .
@prefix : <http://mapping.example.com/> .

## 
# Benchmark RML mapping file to test the performance or XPath processors.
# It refers to the OpenStreetMap XML export for the city of Ghent 
# from BBBike (https://download.bbbike.org/osm/bbbike/Gent/Gent.osm.gz).
##

:TriplesMapBenchmark a rr:TriplesMap;
  rml:logicalSource [
    rml:source "benchmark/Gent.osm";
    rml:referenceFormulation ql:XPath;
    rml:iterator "osm/node[tag/@k = 'highway' and tag/@v = 'bus_stop']" # select all bus stop nodes
  ];

  rr:subjectMap [
    rr:template "http://osm.example.org/node/{@lat}/{@lon}"; 
    rr:termType rr:IRI;
    rr:class osm:Node
  ];

  rr:predicateObjectMap [
      rr:predicate geo:lat;
      rr:objectMap [
          rr:termType rr:Literal;
          rr:datatype xsd:double;
          rr:template "{@lat}" 
      ]
  ], [
      rr:predicate geo:long;
      rr:objectMap [
          rr:termType rr:Literal;
          rr:datatype xsd:double;
          rr:template "{@lon}" 
      ]
  ],[
      rr:predicate rdfs:label;
      rr:objectMap [ rml:reference "tag[@k = 'name']/@v" ]
  ], [
      rr:predicate osm:operator;
      rr:objectMap [ rml:reference "tag[@k = 'operator']/@v" ]
  ].

New possibilities

This integration allows to implement support for XQuery, a SQL-like language for XML procesing (already supported by Saxon) and more advanced features of XPath 3, like JSON querying with XPath.

DylanVanAssche commented 2 years ago

Merged in development.