Open schivmeister opened 3 months ago
Thanks for the very detailed bug report! I'm afraid this is an old RML spec issue, being underspecified how to work with multiple valued references (resulting in sometimes very weird results as you've detailed here, eg in combination with rr:template or a function). We're working on improving the new version of the spec and a more global solution using the Logical Views extension, with a PoC implementation available (and paper being presented next month), however, that's all still in alpha stage.
So, there are actually 3 paths that can be taken in parallel, I think:
We'll check when we can dedicate some time on this bug report, but as you can imagine as an academic institution, it's always trying to find a balance wrt our research roadmaps/paid projects. If this would be really blocking you, feel free to reach out at info@rml.io to see how we can prioritize this!
Thank you for the swift response @bjdmeest! It already helps a lot to know that I'm not (likely) making a mistake somewhere. I understand that offering a resolution is not always possible, which is totally fine. We will reach out if it indeed turns out to be a blocker.
There are potentially other solutions depending on the use case, e.g. in our case it was originally related to a lookup based on modified source values, but we decided to encode certain values in the lookup table as a workaround instead, so that we need not modify the reference.
Otherwise, I took a look again at the Logical Views extension, which I did check out briefly once before for tabular lookups. However, I don't see XML as a supported source format in the reference/PoC implementation, and I also think it attacks a different problem.
Nevertheless, I took the liberty to try and figure out where in the code this is likely happening. It appears to be an issue with the dataio library's XMLRecord.get() implementation as called in ReferenceExtractor::extract(). Trying to reproduce the issue record.get()
yields:
[CustomPrefix ABC Fast CompanyABC FastCo CustomSuffix]
instead of
[CustomPrefix ABC Fast Company CustomSuffix, CustomPrefix ABC FastCo CustomSuffix]
or in the case of a plain reference:
[ABC Fast Company, ABC FastCo]
It could very well be that the concatenation causes unexpected behaviour in the evaluation of the XPaths (using Saxon?), as a direct concat on repeating elements would otherwise raise an error of the form:
error: A sequence of more than one item is not allowed as the second argument of fn:concat() ...
Tested using:
java -cp saxon-he-12.4.jar net.sf.saxon.Query -s:test.xml -qs:"concat('CustomPrefix', /Directory/Organization[1]/Name, ' CustomSuffix')"
But we are not getting an error in the mapping itself, just unexpected concatenation, which indicates that the function works but is being evaluated on the entire set of XPath query results.
I realized after all that we also have template, which works:
rr:objectMap
[
rr:template "CustomPrefix {Name} CustomSuffix" ;
rr:datatype xsd:string ; # an explicit type is required otherwise termType IRI is inferred and error raised
];
So, this is a very valid alternative for simple cases not involving other XPath expressions (added as a workaround in the original post).
Environment
rmlmapper v6.5.1 (reproducible also as far back as v6.1.3) Linux/WSL2 Java 17, 11
Namespaces
Problem
Given the following kind of input XML with two
Organization
elements, where the first has two childName
elements:and the following kind of RML mapping involving a custom concatenated value in the source reference:
Actual
Results in an unexpected output of the first resource's
name
concatenating the repeating values in between the prefix and suffix, instead of multiple comma-separated RDF/Turtle values:Expected
Should result in multiple comma-separated values mapped from the XML child elements, adhering to the condition of the reference:
Workaround
Template bypassing XPath expressions
This is perhaps the closest thing to an actual solution (if you don't need additional XPath complexity):
producing the correct result:
Plain reference with out-of-band strategies
One could skip using the reference altogether and employ a different technique, with something external, to replicate the desired outcome, for e.g. using (custom) functions, or even just looking up a mapping table using a parentTriplesMap.
Removing the concatenation obviously makes it work:
resulting in:
Reoriented iterator
Using an iterator on the child element which repeats but creating the subject using the ancestor element appears to work:
However, this is unintuitive and convoluted. The correct solution would be if repeating child elements were also repeated as values for a predicateObjectMap, as they normally are with a plain reference (or template).
MWE
rml-mwe-concat-multivalue.zip (excludes template example)
Context
This may or may not be related to #227 #228.