RMLio / rmlmapper-java

The RMLMapper executes RML rules to generate high quality Linked Data from multiple originally (semi-)structured data sources
http://rml.io
MIT License
146 stars 61 forks source link

XML namespace needs to be declared in XML file? #144

Closed mcm104 closed 2 years ago

mcm104 commented 2 years ago

Hello!

I'm having trouble writing maps for our prefixed XML, specifically when it comes to attributes with the prefix xml:.

Here is a sample of our XML data:

<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF
   xmlns:rdaw="http://rdaregistry.info/Elements/w/"
   xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
>
   <rdf:Description rdf:about="http://example.org/example">
      <rdaw:P10223 xml:lang="ru">Zapiski iz mertvogo doma</rdaw:P10223>
   </rdf:Description>
</rdf:RDF>

Here is the RML map I have written:

@prefix bf: <http://id.loc.gov/ontologies/bibframe/>.
@prefix ex: <http://example.org/rules/>.
@prefix rml: <http://semweb.mmlab.be/ns/rml#>.
@prefix ql: <http://semweb.mmlab.be/ns/ql#>.

ex:ExampleMap a rr:TriplesMap;
   rml:logicalSource [
      rml:source "RML_demo_data.xml";
      rml:referenceFormulation ql:XPath;
      rml:iterator "/rdf:RDF/rdf:Description"
   ];

   rr:subjectMap [
      rml:reference "@rdf:about";
      rr:class bf:Work
   ];

   rr:predicateObjectMap [
      rr:predicate bf:title;
      rr:objectMap [
         rml:reference "rdaw:P10223";
         rr:termType rr:Literal;
         rml:languageMap [
            rml:reference "rdaw:P10223/@xml:lang"
         ]
      ]
   ].

When I run this through the mapper, this is the output:

javax.xml.transform.TransformerException: Prefix must resolve to a namespace: xml
    at java.xml/com.sun.org.apache.xpath.internal.compiler.XPathParser.error(XPathParser.java:621)
    at java.xml/com.sun.org.apache.xpath.internal.compiler.Lexer.mapNSTokens(Lexer.java:637)
    at java.xml/com.sun.org.apache.xpath.internal.compiler.Lexer.tokenize(Lexer.java:360)
    at java.xml/com.sun.org.apache.xpath.internal.compiler.Lexer.tokenize(Lexer.java:99)
    at java.xml/com.sun.org.apache.xpath.internal.compiler.XPathParser.initXPath(XPathParser.java:115)
    at java.xml/com.sun.org.apache.xpath.internal.XPath.<init>(XPath.java:178)
    at java.xml/com.sun.org.apache.xpath.internal.XPath.<init>(XPath.java:268)
    at java.xml/com.sun.org.apache.xpath.internal.jaxp.XPathImpl.compile(XPathImpl.java:162)
    at be.ugent.rml.records.XMLRecord.get(XMLRecord.java:49)
    at be.ugent.rml.extractor.ReferenceExtractor.extract(ReferenceExtractor.java:31)
    at be.ugent.rml.extractor.ReferenceExtractor.execute(ReferenceExtractor.java:41)
    at be.ugent.rml.termgenerator.LiteralGenerator.lambda$generate$0(LiteralGenerator.java:67)
    at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)
    at be.ugent.rml.termgenerator.LiteralGenerator.generate(LiteralGenerator.java:63)
    at be.ugent.rml.Executor.generatePredicateObjectGraphs(Executor.java:277)
    at be.ugent.rml.Executor.executeWithFunctionV5(Executor.java:233)
    at be.ugent.rml.Executor.executeV5(Executor.java:152)
    at be.ugent.rml.cli.Main.main(Main.java:369)
    at be.ugent.rml.cli.Main.main(Main.java:44)
--------------- linked to ------------------
javax.xml.xpath.XPathExpressionException: javax.xml.transform.TransformerException: Prefix must resolve to a namespace: xml
    at java.xml/com.sun.org.apache.xpath.internal.jaxp.XPathImpl.compile(XPathImpl.java:170)
    at be.ugent.rml.records.XMLRecord.get(XMLRecord.java:49)
    at be.ugent.rml.extractor.ReferenceExtractor.extract(ReferenceExtractor.java:31)
    at be.ugent.rml.extractor.ReferenceExtractor.execute(ReferenceExtractor.java:41)
    at be.ugent.rml.termgenerator.LiteralGenerator.lambda$generate$0(LiteralGenerator.java:67)
    at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)
    at be.ugent.rml.termgenerator.LiteralGenerator.generate(LiteralGenerator.java:63)
    at be.ugent.rml.Executor.generatePredicateObjectGraphs(Executor.java:277)
    at be.ugent.rml.Executor.executeWithFunctionV5(Executor.java:233)
    at be.ugent.rml.Executor.executeV5(Executor.java:152)
    at be.ugent.rml.cli.Main.main(Main.java:369)
    at be.ugent.rml.cli.Main.main(Main.java:44)
Caused by: javax.xml.transform.TransformerException: Prefix must resolve to a namespace: xml
    at java.xml/com.sun.org.apache.xpath.internal.compiler.XPathParser.error(XPathParser.java:621)
    at java.xml/com.sun.org.apache.xpath.internal.compiler.Lexer.mapNSTokens(Lexer.java:637)
    at java.xml/com.sun.org.apache.xpath.internal.compiler.Lexer.tokenize(Lexer.java:360)
    at java.xml/com.sun.org.apache.xpath.internal.compiler.Lexer.tokenize(Lexer.java:99)
    at java.xml/com.sun.org.apache.xpath.internal.compiler.XPathParser.initXPath(XPathParser.java:115)
    at java.xml/com.sun.org.apache.xpath.internal.XPath.<init>(XPath.java:178)
    at java.xml/com.sun.org.apache.xpath.internal.XPath.<init>(XPath.java:268)
    at java.xml/com.sun.org.apache.xpath.internal.jaxp.XPathImpl.compile(XPathImpl.java:162)
    ... 11 more
@prefix bf: <http://id.loc.gov/ontologies/bibframe/> .

<http://example.org/example> a bf:Work .

I thought this was strange, because the XML namespace according to W3C doesn't need to be declared, but the XPath parser used here is giving me an error. Sure enough, if I do declare the namespace:

Modified XML data:

<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF
   xmlns:rdaw="http://rdaregistry.info/Elements/w/"
   xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
   xmlns:xml="http://www.w3.org/XML/1998/namespace"
>
   <rdf:Description rdf:about="http://example.org/example">
      <rdaw:P10223 xml:lang="ru">Zapiski iz mertvogo doma</rdaw:P10223>
   </rdf:Description>
</rdf:RDF>

And I keep the RML the same, I get my desired output:

@prefix bf: <http://id.loc.gov/ontologies/bibframe/> .

<http://example.org/example> a bf:Work;
   bf:title "Zapiski iz mertvogo doma"@ru .

I guess I'm wondering if this is an RML Mapper issue, an issue with this particular XPath parser, or if we're just going to have to go through all of our data and declare that XML namespace even though we shouldn't have to... Any thoughts or suggestions?

Thanks in advance!

DylanVanAssche commented 2 years ago

Hi @mcm104 !

XML namespaces cannot be defined yet using RMLmapping rules. Therefore, the XPath engine does not know about any namespaces.

mcm104 commented 2 years ago

I'm not sure what further information you wanted, but to clarify: we have been using RML with our prefixed XML for a while now. It is only with the current version of the RML Mapper that we have run into this issue.

With earlier versions of the mapper, we did not need to include prefixes in our XPaths at all. The above RML I included in my original issue would have been written as follows:

@prefix bf: <http://id.loc.gov/ontologies/bibframe/>.
@prefix ex: <http://example.org/rules/>.
@prefix rml: <http://semweb.mmlab.be/ns/rml#>.
@prefix ql: <http://semweb.mmlab.be/ns/ql#>.

ex:ExampleMap a rr:TriplesMap;
   rml:logicalSource [
      rml:source "RML_demo_data.xml";
      rml:referenceFormulation ql:XPath;
      rml:iterator "/RDF/Description"
   ];

   rr:subjectMap [
      rml:reference "@about";
      rr:class bf:Work
   ];

   rr:predicateObjectMap [
      rr:predicate bf:title;
      rr:objectMap [
         rml:reference "P10223";
         rr:termType rr:Literal;
         rml:languageMap [
            rml:reference "P10223/@lang"
         ]
      ]
   ].

We were writing RML successfully this way for a long time. When someone on our team newly downloaded the mapper a couple weeks ago (version 4.13.0), these maps were no longer working for them. The solution we arrived at was to write the prefixes into the XPaths after all, which worked for the vast majority of XML elements and attributes we need to reference, all prefixed. However, this is where we ran into the above issue with the XML namespace -- not XML namespaces in general -- THE specific namespace for the xml: prefix.

Other namespaces are declared in our XML data as needed (those are all working just fine), but the xml: namespace isn't declared in our data, because it is a reserved namespace in XML and declaring it is not required in well-formed XML. Despite this, the RML Mapper throws the error I copy/pasted in above, because I have not explicitly declared the xml: namespace in our data. I'm surprised that an XPath parser that has the ability to parse namespaces in XML wouldn't be coded to anticipate this.

For now, we will go forward implementing the following workaround:

@prefix bf: <http://id.loc.gov/ontologies/bibframe/>.
@prefix ex: <http://example.org/rules/>.
@prefix rml: <http://semweb.mmlab.be/ns/rml#>.
@prefix ql: <http://semweb.mmlab.be/ns/ql#>.

ex:ExampleMap a rr:TriplesMap;
   rml:logicalSource [
      rml:source "RML_demo_data.xml";
      rml:referenceFormulation ql:XPath;
      rml:iterator "/rdf:RDF/rdf:Description"
   ];

   rr:subjectMap [
      rml:reference "@rdf:about";
      rr:class bf:Work
   ];

   rr:predicateObjectMap [
      rr:predicate bf:title;
      rr:objectMap [
         rml:reference "rdaw:P10223";
         rr:termType rr:Literal;
         rml:languageMap [
            rml:reference "rdaw:P10223/@*[namespace-uri()='http://www.w3.org/XML/1998/namespace' and local-name()='lang']"
         ]
      ]
   ].

If this really is just an issue with the XML parser, we will continue writing our XPaths this way whenever we need to reference an element/attribute with an xml: prefix. However, I do see this as an issue with the functionality of the RML Mapper. There are only a few namespaces that are reserved in XML, and I think the mapper should be able to handle these without them being declared.

DylanVanAssche commented 2 years ago

Hi @mcm104 !

I think I know what happened because it worked in the previous versions... Recently (4.13.0+) support was added to read the XML namespaces of a file and register them upfront so that it would work. However, it assumes that the file has all the prefixes in there. Since the xml: prefix is assumed to be known, maybe we need to add it manually in the RMLMapper so it is properly registered in the XPath engine.

This is the entry from the changelog:

XML parsing: allow parsing of fully namespaced xml by injecting xml source's namespaces in the XPath compiler (see issue 134)

As soon as there's a fix, we get back to you ASAP :)

DylanVanAssche commented 2 years ago

@mcm104 Hotfix is available on the development branch: https://github.com/RMLio/rmlmapper-java/tree/development

I tested it with your files and it works now correctly again, the regression was introduced with that feature I mentioned above.

mcm104 commented 2 years ago

Thank you so much! Finally got a chance to test this for myself, and it's working perfectly!

DylanVanAssche commented 2 years ago

Fixed in 4.15.0, closing.