So I tested this build with a relatively large XML source (~300Mb), namely an OpenStreetMap extract for the city of Ghent and the results are very promising, both in terms of performance and memory footprint:
This build fully maps the file in ~14s with the default JVM memory limit.
The current release of RML Mapper (v5.0.0) wasn't able to map the file after ~2h and with all the memory available in my laptop (-Xmx16384m).
I used the following RML mapping:
@prefix rr: <http://www.w3.org/ns/r2rml#> .
@prefix osm: <https://w3id.org/openstreetmap/terms#> .
@prefix geo: <http://www.w3.org/2003/01/geo/wgs84_pos#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix rml: <http://semweb.mmlab.be/ns/rml#> .
@prefix ql: <http://semweb.mmlab.be/ns/ql#> .
@prefix : <http://mapping.example.com/> .
##
# Benchmark RML mapping file to test the performance or XPath processors.
# It refers to the OpenStreetMap XML export for the city of Ghent
# from BBBike (https://download.bbbike.org/osm/bbbike/Gent/Gent.osm.gz).
##
:TriplesMapBenchmark a rr:TriplesMap;
rml:logicalSource [
rml:source "benchmark/Gent.osm";
rml:referenceFormulation ql:XPath;
rml:iterator "osm/node[tag/@k = 'highway' and tag/@v = 'bus_stop']" # select all bus stop nodes
];
rr:subjectMap [
rr:template "http://osm.example.org/node/{@lat}/{@lon}";
rr:termType rr:IRI;
rr:class osm:Node
];
rr:predicateObjectMap [
rr:predicate geo:lat;
rr:objectMap [
rr:termType rr:Literal;
rr:datatype xsd:double;
rr:template "{@lat}"
]
], [
rr:predicate geo:long;
rr:objectMap [
rr:termType rr:Literal;
rr:datatype xsd:double;
rr:template "{@lon}"
]
],[
rr:predicate rdfs:label;
rr:objectMap [ rml:reference "tag[@k = 'name']/@v" ]
], [
rr:predicate osm:operator;
rr:objectMap [ rml:reference "tag[@k = 'operator']/@v" ]
].
New possibilities
This integration allows to implement support for XQuery, a SQL-like language for XML procesing (already supported by Saxon) and more advanced features of XPath 3, like JSON querying with XPath.
What is this?
This PR integrates the Saxon-HE v11 library into the mapper, as a XML and XPath parser, to bring support for more advanced XPath capabilities.
Motivation
Unlike the currently used XML parser, which only supports XPath v1.0 and a handful of functions, Saxon supports up to XPath v3.1.
There is a significant difference in terms of expressivity between XPath v1 and v3, which is often needed when generating RDF from XML sources.
Tests
This PR passes all existing XML test-cases and adds 2 new ones:
It is also able to handle gracefully XML namespaces including this default namespace issue. However the implementation could be improved.
Performance
Saxon is supposed to be more performant than the default Java XML implementation.
So I tested this build with a relatively large XML source (~300Mb), namely an OpenStreetMap extract for the city of Ghent and the results are very promising, both in terms of performance and memory footprint:
-Xmx16384m
).I used the following RML mapping:
New possibilities
This integration allows to implement support for XQuery, a SQL-like language for XML procesing (already supported by Saxon) and more advanced features of XPath 3, like JSON querying with XPath.