RMLio / rmlmapper-java

The RMLMapper executes RML rules to generate high quality Linked Data from multiple originally (semi-)structured data sources
http://rml.io
MIT License
147 stars 61 forks source link

Parse XML with namespace #132

Closed florent-andre closed 2 years ago

florent-andre commented 2 years ago

Hello, First, thanks for this promising tool set. And I hope I send the question on the good canal and repository.

I try to map an xml with namespaces for nodes (an xsd type file). When I remove the namespaces from my source file, the test triples are generated. But when I restaure namespace in the xml file and add xsd: ns to xpath, I get an empty set of triples.

As I find no example of "xml with namespace" parsing, I ask myself how I can do that.

Here is the example I try to tackle, this can be added to mattey. Thanks for you help, regards

DylanVanAssche commented 2 years ago

Hi! Thanks for reaching and using our tools!

Unfortunately, this is a long standing issue we haven't been able to properly resolve. In the predecessor of the rmlmapper-java, we had 2 issues about this:

but without a proper resolution. If you have any feedback on how to resolve this properly in the mapping rules, using a CLI parameter, etc. feel free to comment below! We would love to have some feedback on this.

florent-andre commented 2 years ago

Humm... maybe extract the source's xmlns and reuse them in xpath call ? This require a well formated xml. But it's the minimum... I don't know how the xpath interpreter is configurable, but passing the source's xmlns should be doable.

I think it's better "mapping man" experience than the declarative way of the Carmel implementation seems to do this :

carml:declaresNamespace [
        carml:namespacePrefix "edxl-cap" ;
        carml:namespaceName "http://release.niem.gov/niem/adapters/edxl-cap/3.0/" ;

Linked to kg-construct/rml-fno-spec#9

DylanVanAssche commented 2 years ago

Humm... maybe extract the source's xmlns and reuse them in xpath call ?

That might be a possibility to workaround this problem, we always welcome any PRs to help out!

In the meantime, I brought this to the attention of the W3C Community Group working around RML and other mapping language to have a standard like R2RML for transforming heterogeneous data into RDF, see kg-construct/rml-target-source-spec#4

florent-andre commented 2 years ago

Hi, I can try to have a look, but java is a long time souvenir, and any guidance on the class involved will be appreciated.

DylanVanAssche commented 2 years ago

Hi @florent-andre Sure! Happy to assist you :) To extend the XPath extractor, you probably want to look at getDocumentFromStream method of XMLRecordFactory. There you can configure the DocumentBuilderFactory.

You can also read the InputStream argument there already to look for XML namespaces.

florent-andre commented 2 years ago

@DylanVanAssche please find a PR for solving namespaced xpath.

Please note, that it fix work for full namespaced tree. If the xml mix namespaced and not, this should be explored. See this document for detail about this: "even the default namespace is a namespace, and thus matching names have to be prefixed in XPath".

Another remark: What do you think about creating an xPathSingleton to provide the xPath object and not create multiple instances of it in XMLRecord and XMLRecordFactory:

XPath xPath = XPathFactory.newInstance().newXPath();
xPath.setNamespaceContext(new NamespaceResolver(document));
DylanVanAssche commented 2 years ago

@florent-andre Thanks for the PR! I will have a look next week :)

If the xml mix namespaced and not, this should be explored. See this document for detail about this: "even the default namespace is a namespace, and thus matching names have to be prefixed in XPath".

I'm not that familiar with XML namespaces, but I think this is PR is a good start in general, we can just mention it with a TODO comment that this case is not explored.

What do you think about creating an xPathSingleton to provide the xPath object and not create multiple instances of it in XMLRecord and XMLRecordFactory:

That would actually be better I think... Feel free to try it :) As long as the test cases still pass after this change it is fine.

pheyvaer commented 2 years ago

Regarding avoiding creating multiple xPath objects, I would strongly advice against using the Singleton pattern, especially because it complicates testing.

florent-andre commented 2 years ago

Get your point about Singleton. The actual PR don't implement Singleton and "nondependants tests" pass.

florent-andre commented 2 years ago

As this PR was merged, I close this issue. Thanks guys for building and maintaining this lib !