apache / jena

Apache Jena
https://jena.apache.org/
Apache License 2.0
1.08k stars 643 forks source link

GH-2557: RDF term normalization #2564

Closed afs closed 2 days ago

afs commented 3 days ago

GitHub issue resolved #2557

Pull request Description:

Normalization is mostly involved with literals where the lexical form has multiple ways to write the same value. Numbers are the major case: "1", "001", "+1" are ways to write the xsd:integer value 1.

There is now a framework for normalization, with 3 implementations, ARQ choices, XSD 1.0 and XSD 1.1. The XSD differ in a few places: the canonical form of xsd:decimals for integer values, XSD 1.1 adds +INF.

The ARQ choices are, as near as possible, the same as previous versions and align with the effect of adding terms to TDB2 and reading them back again. This was the case up to Jena 5.0.0 but wasn't tested. Now it is contract so data can be normalized in a way thatTDB2 will preserve the exact RDF terms for float and double precision.

Many files are changed because of a rename of NodeFactory.createLiteral(string, datatype) to NodeFactory.createLiteralDT(String, datatype) to align with other 5.0.0 renaming. A deprecated placeholder is left behind.


By submitting this pull request, I acknowledge that I am making a contribution to the Apache Software Foundation under the terms and conditions of the Contributor's Agreement.


See the Apache Jena "Contributing" guide.

afs commented 2 days ago

This looks really good

All my comments are just mfg=typos/grammar issues in the comments, actual implementation code all looks fine as far as I understand the relevant XSD specifications

Yes - this is spec heavy!

afs commented 2 days ago

Corrections merged - thanks.