RDFLib / rdflib

RDFLib is a Python library for working with RDF, a simple yet powerful language for representing information.
https://rdflib.readthedocs.org
BSD 3-Clause "New" or "Revised" License
2.18k stars 558 forks source link

It is possible to create invalid RDF with rdflib #895

Closed nbaksalyar closed 4 years ago

nbaksalyar commented 5 years ago

I was able to successfully create the following RDF graph:

from rdflib import Graph, URIRef, Literal
g = Graph()
g.add((Literal('A'), Literal('B'), Literal('C')))

It is serialised into Turtle as follows:

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
<...>
"Hi" "Hello" "Bye" .

And according to the RDFLib code it is a valid triple. However, according to the RDF specification it is not:

An RDF triple consists of three components:

*    the subject, which is an IRI or a blank node
*    the predicate, which is an IRI
*    the object, which is an IRI, a literal or a blank node

And indeed when the output Turtle is validated with the W3C validator, it produces this error:

Expected subject but got literal at line 6.

Hence, this bit in the README file is also incorrect:

The components of the triples are URIs (resources) or Literals (values),

I think this issue deserves a further discussion, and please let me know if I can fix this and raise a PR.

prince17080 commented 4 years ago

@nicholascar Is this issue still open? I'm thinking to solve this issue by adding assert statements in the the add function.

white-gecko commented 4 years ago

Adding assert statements in production code is no good idea. They do not work if optimization is enabled. https://docs.python.org/3/reference/simple_stmts.html

white-gecko commented 4 years ago

Maybe this is also a cool feature to use the RDF lib to work with generalised RDF (sorry can't find the reference now on my mobile, but see also https://www.w3.org/2001/sw/wiki/Literals_as_Subjects). However, the serializers should only create output valid for the respective format.

tgbugs commented 4 years ago

There isn't a good way to catch issues like this at runtime without taking a performance hit. If this is implemented in rdflib core, then maybe it could be as an add_safe function or similar, that could check for these kinds of things. There are even more insidious variants where some python type will sneak in to a literal and only cause issues when trying to sort entities somewhere else in the code base. Here is an example of I deal with this https://github.com/SciCrunch/sparc-curation/blob/f02a0cd1cbc27699a6525966b4d94b1111e14295/sparcur/export/triples.py#L90-L101 (this particular version doesn't catch the fact that literals are not valid subjects, but could be added).

One solution to these kinds of issues is a static type system. I'm not sure if python's type annotations would help here, but the might be a minimally obtrusive way to add this functionality for those who want it.

gromgull commented 4 years ago

My feeling is maybe this should be caught at serialisation time, like we do with non rdf/xml serialisable URIs etc.

nicholascar commented 4 years ago

@prince17080 obviously still open! Sorry, we didn't get to handle this for 5.0.0.

@gromgull not sure this should only be handles "on the way out" at serlialization time. I think this really should be handled "on the way in" at creation, my general feeling being, if the spec (RDF) doesn't allow something, we shouldn't allow it in the Python Graph object.

We don't tolerate this:

g.add((URIRef('A:'), URIRef('B:'), 25))

AssertionError: Object 25 must be an rdflib term

So we shouldn't tolerate:

g.add((Literal('A'), Literal('B'), Literal('C')))

Unless there's some very good code speed reason etc. that prevents us from not tolerating it.

Adding assert statements in production code is no good idea.

Clearly we have a bunch of them in there already...

white-gecko commented 4 years ago

A generalized RDF triple is a triple having a subject, a predicate, and object, where each can be an IRI, a blank node or a literal. A generalized RDF graph is a set of generalized RDF triples. A generalized RDF dataset comprises a distinguished generalized RDF graph, and zero or more pairs each associating an IRI, a blank node or a literal to a generalized RDF graph. (https://www.w3.org/TR/rdf11-concepts/#dfn-generalized-rdf-dataset)

But there is also a note:

Any users of generalized RDF triples, graphs or datasets need to be aware that these notions are non-standard extensions of RDF and their use may cause interoperability problems. There is no requirement on the part of any RDF tool to accept, process, or produce anything beyond standard RDF triples, graphs, and datasets.

In SPARQL 1.1, Literals are allowed in the subject position.

https://www.w3.org/TR/sparql11-query/#construct https://www.w3.org/TR/sparql11-query/#sparqlTriplePatterns

JSON-LD says:

However, JSON-LD also extends the RDF data model to optionally allow JSON-LD to serialize generalized RDF Datasets. (https://www.w3.org/TR/json-ld/#relationship-to-rdf)

They also put up a note:

The use of blank node identifiers to label properties is obsolete, and may be removed in a future version of JSON-LD, as is the support for generalized RDF Datasets.

As a conclusion we don't need to support it, but we can. As it is currently possible, whhat is a good reason to remove it from the Graph, if we can also ensure it for the serializers.

Adding assert statements in production code is no good idea.

Clearly we have a bunch of them in there already...

Maybe we should see if we can remove them in the future.

prince17080 commented 4 years ago

@nicholascar @white-gecko @tgbugs So, I guess the conclusion from the above discussion can be adding a new function add_safe is the best solution with assert statements.

nicholascar commented 4 years ago

@prince17080 indeed there is the suggestion to introduce add_safe and a supporter of that so you could perhaps create a PR for such a function.

Not sure I would use it much though - like @tgbugs I check my entries to a Graph() in my own application code rather than throwing inserts at add and then having it validate the values - but happy to have add_safe proposed if others have use cases motivating such a function.

oohlaf commented 4 years ago

There's check_statement in util. A wrapper would be nothing more than a call to check_statement.

nicholascar commented 4 years ago

Thanks @oohlaf, really useful tip!

white-gecko commented 4 years ago

I close this issue now. Maybe someone finds a place for the hint of @oohlaf in the documentation or someone should ask and answer it on https://stackoverflow.com/questions/tagged/rdflib .