RDFLib / rdflib

RDFLib is a Python library for working with RDF, a simple yet powerful language for representing information.
https://rdflib.readthedocs.org
BSD 3-Clause "New" or "Revised" License
2.17k stars 555 forks source link

Literal should always have a datatype #1326

Open white-gecko opened 3 years ago

white-gecko commented 3 years ago

According to RDF 1.1 (https://www.w3.org/TR/rdf11-concepts/#section-Graph-Literal)

A literal in an RDF graph consists of two or three elements:

  • a lexical form, being a Unicode [UNICODE] string, which SHOULD be in Normal Form C [NFC],
  • a datatype IRI, being an IRI identifying a datatype that determines how the lexical form maps to a literal value, and
  • if and only if the datatype IRI is http://www.w3.org/1999/02/22-rdf-syntax-ns#langString, a non-empty language tag as defined by [BCP47]. The language tag MUST be well-formed according to section 2.2.9 of [BCP47].

A literal is a language-tagged string if the third element is present. Lexical representations of language tags MAY be converted to lower case. The value space of language tags is always in lower case.

Please note that concrete syntaxes MAY support simple literals consisting of only a lexical form without any datatype IRI or language tag. Simple literals are syntactic sugar for abstract syntax literals with the datatype IRI http://www.w3.org/2001/XMLSchema#string. Similarly, most concrete syntaxes represent language-tagged strings without the datatype IRI because it always equals http://www.w3.org/1999/02/22-rdf-syntax-ns#langString.

So the datatype of Literals should always be set.

If a language is specified and a datatype it has to be ensured that the datatype is http://www.w3.org/1999/02/22-rdf-syntax-ns#langString

See also #670

kouralex commented 3 years ago

I presume always setting the datatype will fix the issue I just encountered with duplicate (implicit and explicit) literal entries gathered from multiple files. Minimalistic example (RDFLib version: 5.0.0):

import sys 
from rdflib import Graph
g = Graph().parse(format='ttl', data='<http://a> <http://b> ""^^<http://www.w3.org/2001/XMLSchema#string>,  "" .')
g.serialize(format="ttl", destination=sys.stdout.buffer)

Output:

@prefix ns1: <http://> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

ns1:a ns1:b "",
        ""^^xsd:string .

Expected:

@prefix ns1: <http://> .

ns1:a ns1:b "" .

For now, I just normalized all triples (dropped explicit ^^xsd:string part with sed) in the data before parsing them with RDFLib.

I was just wondering if there should be a configurable parameter for graph serialization in either the implicit or explicit form, now that closing this issue will require touching those parts in code. What do you think?

aucampia commented 1 year ago

https://github.com/RDFLib/rdflib/issues/2123#issuecomment-1475448693

One option to solve this is to enforce that rdflib.terms.Literal always has a datatype, but then we won't be able to support RDF 1.0 anymore. I'm somewhat okay with this, I think it would be nice to be able to support 1.0 and 1.1 - but I think 1.1. support is more important.