RMLio / rmlmapper-java

The RMLMapper executes RML rules to generate high quality Linked Data from multiple originally (semi-)structured data sources
http://rml.io
MIT License
144 stars 61 forks source link

XML predicate mapping repeating child elements getting concatenated if reference includes concatenation #235

Open schivmeister opened 3 months ago

schivmeister commented 3 months ago

Environment

rmlmapper v6.5.1 (reproducible also as far back as v6.1.3) Linux/WSL2 Java 17, 11

Namespaces

@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix rr: <http://www.w3.org/ns/r2rml#> .
@prefix rml: <http://semweb.mmlab.be/ns/rml#> .
@prefix ql: <http://semweb.mmlab.be/ns/ql#> .
@prefix ex: <http://data.example.org/resource/> .
@prefix org: <http://www.w3.org/ns/org#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix fnml:   <http://semweb.mmlab.be/ns/fnml#> .
@prefix fno: <https://w3id.org/function/ontology#> .
@prefix idlab-fn: <http://example.com/idlab/function/> .

Problem

Given the following kind of input XML with two Organization elements, where the first has two child Name elements:

<Directory>
    <Organization>
        <ID>123</ID>
        <Name>ABC Fast Company</Name>
        <Name>ABC FastCo</Name>
    </Organization>
    <Organization>
        <ID>456</ID>
        <Name>XYZ Inc.</Name>
    </Organization> 
</Directory>

and the following kind of RML mapping involving a custom concatenated value in the source reference:

ex:Organizations a rr:TriplesMap;
    rml:logicalSource [
        rml:source "test.xml";
        rml:iterator "/Directory/Organization";
        rml:referenceFormulation ql:XPath
    ];
    rr:subjectMap [
        rr:template "http://data.example.org/resource/Organization_{ID}";
        rr:class org:Organization
    ];
    rr:predicateObjectMap [
        rr:predicate org:name;
        rr:objectMap
            [
                rml:reference "'CustomPrefix ' || Name || ' CustomSuffix'"
            ];
    ]
.

Actual

Results in an unexpected output of the first resource's name concatenating the repeating values in between the prefix and suffix, instead of multiple comma-separated RDF/Turtle values:

ex:Organization_123 a org:Organization;
  org:name "CustomPrefix ABC Fast CompanyABC FastCo CustomSuffix" . # these are values from two Name elements

# the second resource remains unaffected (correctly formed)
ex:Organization_456 a org:Organization;
  org:name "CustomPrefix XYZ Inc. CustomSuffix" .

Expected

Should result in multiple comma-separated values mapped from the XML child elements, adhering to the condition of the reference:

ex:Organization_123 a org:Organization;
  org:name "CustomPrefix ABC Fast Company CustomSuffix", "CustomPrefix ABC FastCo CustomSuffix" .

Workaround

Template bypassing XPath expressions

This is perhaps the closest thing to an actual solution (if you don't need additional XPath complexity):

        rr:objectMap
            [
                rr:template "CustomPrefix {Name} CustomSuffix" ;
                rr:datatype xsd:string ; # an explicit type is required otherwise termType IRI is inferred and error raised
            ];

producing the correct result:

ex:Organization_123 a org:Organization;
  org:name "CustomPrefix ABC Fast Company CustomSuffix", "CustomPrefix ABC FastCo CustomSuffix" .

ex:Organization_456 a org:Organization;
  org:name "CustomPrefix XYZ Inc. CustomSuffix" .

Plain reference with out-of-band strategies

One could skip using the reference altogether and employ a different technique, with something external, to replicate the desired outcome, for e.g. using (custom) functions, or even just looking up a mapping table using a parentTriplesMap.

Removing the concatenation obviously makes it work:

        rr:objectMap
            [
                rml:reference "Name"
            ];

resulting in:

ex:Organization_123 a org:Organization;
  org:name "ABC Fast Company", "ABC FastCo" .

Reoriented iterator

Using an iterator on the child element which repeats but creating the subject using the ancestor element appears to work:

ex:Organizations a rr:TriplesMap;
    rml:logicalSource [
        rml:source "test.xml";
        rml:iterator "/Directory/Organization/Name";
        rml:referenceFormulation ql:XPath
    ];
    rr:subjectMap [
        rr:template "http://data.example.org/resource/Organization_{../ID}";
        rr:class org:Organization
    ];
    rr:predicateObjectMap [
        rr:predicate org:name;
        rr:objectMap
            [
                rml:reference "'CustomPrefix ' || . || ' CustomSuffix'"
            ];
    ]
.

However, this is unintuitive and convoluted. The correct solution would be if repeating child elements were also repeated as values for a predicateObjectMap, as they normally are with a plain reference (or template).

MWE

rml-mwe-concat-multivalue.zip (excludes template example)

Context

This may or may not be related to #227 #228.

bjdmeest commented 2 months ago

Thanks for the very detailed bug report! I'm afraid this is an old RML spec issue, being underspecified how to work with multiple valued references (resulting in sometimes very weird results as you've detailed here, eg in combination with rr:template or a function). We're working on improving the new version of the spec and a more global solution using the Logical Views extension, with a PoC implementation available (and paper being presented next month), however, that's all still in alpha stage.

So, there are actually 3 paths that can be taken in parallel, I think:

We'll check when we can dedicate some time on this bug report, but as you can imagine as an academic institution, it's always trying to find a balance wrt our research roadmaps/paid projects. If this would be really blocking you, feel free to reach out at info@rml.io to see how we can prioritize this!

schivmeister commented 2 months ago

Thank you for the swift response @bjdmeest! It already helps a lot to know that I'm not (likely) making a mistake somewhere. I understand that offering a resolution is not always possible, which is totally fine. We will reach out if it indeed turns out to be a blocker.

There are potentially other solutions depending on the use case, e.g. in our case it was originally related to a lookup based on modified source values, but we decided to encode certain values in the lookup table as a workaround instead, so that we need not modify the reference.

Otherwise, I took a look again at the Logical Views extension, which I did check out briefly once before for tabular lookups. However, I don't see XML as a supported source format in the reference/PoC implementation, and I also think it attacks a different problem.

Nevertheless, I took the liberty to try and figure out where in the code this is likely happening. It appears to be an issue with the dataio library's XMLRecord.get() implementation as called in ReferenceExtractor::extract(). Trying to reproduce the issue record.get() yields:

[CustomPrefix ABC Fast CompanyABC FastCo CustomSuffix]

instead of

[CustomPrefix ABC Fast Company CustomSuffix, CustomPrefix ABC FastCo CustomSuffix]

or in the case of a plain reference:

[ABC Fast Company, ABC FastCo]

It could very well be that the concatenation causes unexpected behaviour in the evaluation of the XPaths (using Saxon?), as a direct concat on repeating elements would otherwise raise an error of the form:

error: A sequence of more than one item is not allowed as the second argument of fn:concat() ...

Tested using:

java -cp saxon-he-12.4.jar net.sf.saxon.Query -s:test.xml -qs:"concat('CustomPrefix', /Directory/Organization[1]/Name, ' CustomSuffix')"

But we are not getting an error in the mapping itself, just unexpected concatenation, which indicates that the function works but is being evaluated on the entire set of XPath query results.

schivmeister commented 2 months ago

I realized after all that we also have template, which works:

        rr:objectMap
            [
                rr:template "CustomPrefix {Name} CustomSuffix" ;
                rr:datatype xsd:string ; # an explicit type is required otherwise termType IRI is inferred and error raised
            ];

So, this is a very valid alternative for simple cases not involving other XPath expressions (added as a workaround in the original post).