RDF literal representation

Anniepoo commented 8 years ago

Jan put this in the README, and our experience writing commercial code based on the RDF libs shows it's definitely a sore spot. There are too many ways a literal can be represented.

I think this discussion's already begun, I'm just adding it to the roadmap list.

wouterbeek commented 8 years ago

The matter of (1) representing RDF literals ties in with the matter of (2) interpreting them (so that you can use their values to do calculations in Prolog) and the matter of (3) canonically serializing them (since using canonical serializations gives you tremendous computational benefits such as UNA).

1. Representation

There are currently 3 compound term representations for RDF literals:

literal(type(D,Lex))
literal(lang(Lang,Lex))
literal(Lex)

1.1 Simple literals

Variant 3 used to be called a "simple literal" in RDF 1.0. In RDF 1.1 it is redefined as an (implicitly typed) XSD string. Variant 3 can be trivially rewritten in terms of 1 according to [1]. See this Semweb issue for more details.

[1] literal(Lex) ==> literal(type(xsd:string,Lex))

1.2 Language-tagged strings

Variant 2 cannot be reduced to / expressed in terms of variant 1 because the values of language-tagged strings do not have any lexical form Lex according to the standard (they are idiosyncratic).

According to the standard the values of language-tagged strings are pairs of their lexical form and language tag in that order. It would therefore be more in line with RDF 1.1 to write literal(lang(Lex,Lang)). This ties in with topic 1.4 since in Trig/Turtle you also write "Lex"@Lang rather than Lang@"Lex".

1.3 Non-atomic lexical forms

According to the RDF 1.1 specification, the lexical form must always be a serialization of a datatyped value. In semweb/rdf_db it is possible to assert non-atom lexical forms in the object term. Currently this is even possible if the value does not belong to the value space of RDF datatype D, e.g., literal(type(xsd:integer,1.1)). Another problem is that the programmer, in order to be able to process all triples, must anticipate atomic and non-atomic values. See the corresponding Semweb issue for more details.

My solution in plRdf is to allow a user to assert common Prolog values directly but have them always be asserted in an appropriate datatype. This is not perfect: if the user intents to assert 1 as a non-negative integer then this cannot be deduced automatically. One therefore chooses one of the high-level datatypes (xsd:integer in this case) that are still correct but possibly too generic from the user's point of view. This is currently implemented in predicate rdf_assert_literal_pl(+S, +P, +Value) that heuristically finds an appropriate datatype for Prolog value Value. It currently supports atom, date/3, date/9, time/3, float, HTML DOM, integer, rational, string, XML DOM and pair of atoms (where the former denotes a language tag and the latter denotes a text string). Adding a hook to such a predicate will allow users to map domain datatypes automatically.

An extension to the above could look for [2] in order to assert the more specific [3] upon the user calling [2]. This may however cause problems in case [2] is dynamically asserted/retracted.

[2] rdf(P, rdfs:range, xsd:nonNegativeInteger) [3] rdf_assert_literal_pl(S, P, 1) [4] rdf(S, P, literal(type(xsd:nonNegativeIntege, 1)))

1.4 Syntactic sugar

There was also an interesting idea by Jan to allow users to write variant (1) as "Lex"^^<D> and variant (2) as "Lex"@Lang. This is the Trig/Turtle syntax that many people in the Semantic Web field like. I personally prefer the Prolog compound term representation but I don't think there is anything wrong with having rewrite rules from Trig/Turtle to Prolog here. The question is of course whether users will expect more support for Trig/Turtle syntactic sugar or whether this will be a one-off thing for RDF literals only, e.g., [1].

[5] <http://my_graph> { "_:1 <http://example.org/this/is/> "Trig notation for a triple."@en }

2. Interpretation

Orthogonal to the discussion about RDF literal representation is the discussion about RDF literal interpretation. Currently rdf_literal_value/2 interprets decimals in the wrong way. See this Semweb issue for more details.

plRdf's predicate `rdf_literal(?S, ?P, ?D, ?Value) solves this by allowing the user to directly match datatypes D and interpreted Prolog Values. The user never sees the lexical form that is only interesting for exporting/sharing the data.

3. Canonical representation

There is a tremendous benefit to knowing a dataset has all its literals canonically represented. Every RDF datatype comes with such a value-2-canonical-lexical mapping. Most of these mappings (notable exceptions XSD float and XSD double) have been implemented in plXsd.

wouterbeek commented 8 years ago

These issues are currently treated in a proposal for redesigning Semweb: https://github.com/SWI-Prolog/packages-semweb/wiki/Proposal-for-Semweb-library-redesign

wouterbeek commented 7 years ago

Almost all of this is currently implemented in library(semweb/rdf11). The only exception is decimal support, but this is a minor issue for most people.

SWI-Prolog / roadmap