Formats that don't support named graphs serialize `Dataset`s and `ConjunctiveGraph`s with non-default graphs without raising any errors

RDFLib / rdflib

RDFLib is a Python library for working with RDF, a simple yet powerful language for representing information.

https://rdflib.readthedocs.org

BSD 3-Clause "New" or "Revised" License

2.17k stars 558 forks source link

Formats that don't support named graphs serialize `Dataset`s and `ConjunctiveGraph`s with non-default graphs without raising any errors #2393

Open aucampia opened 1 year ago

aucampia commented 1 year ago

If RDFLib is used to serialize a Dataset or ConjunctiveGraph that contains non-default graphs [ref] as a format that does not support named graphs (i.e. N-Triples or Turtle) no error is raised, and the output is wrong.

Given this data in test/data/variants/diverse_quads.nq (equivalent to this trig document):

https://github.com/RDFLib/rdflib/blob/ddcc4eb622a000cf991f9c530d55d62115484fca/test/data/variants/diverse_quads.nq#L1-L10

Using rdfpipe to convert it to turtle gives:

$ pipx run --spec git+https://github.com/RDFLib/rdflib.git@main rdfpipe -o turtle test/data/variants/diverse_quads.nq; echo status=$?
⚠️  rdfpipe is already on your PATH and installed at /home/iwana/.local/bin/rdfpipe. Downloading and running anyway.
@prefix ns1: <http://example.com/> .
@prefix ns2: <urn:example:> .
@prefix ns3: <example:> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

ns1:subject ns1:predicate "typeless",
        "日本語の表記体系"@jpx .

ns2:subject ns3:predicate ns3:subject ;
    ns2:predicate ns2:object .

ns3:subject ns3:predicate ns3:object,
        12 ;
    ns1:predicate ns1:object,
        "XSD string" .

status=0

And using rdfpipe to convert it to ntriples gives:

$ pipx run --spec git+https://github.com/RDFLib/rdflib.git@main rdfpipe -o ntriples test/data/variants/diverse_quads.nq; echo status=$?
⚠️  rdfpipe is already on your PATH and installed at /home/iwana/.local/bin/rdfpipe. Downloading and running anyway.
/home/iwana/.local/pipx/.cache/bf79805433c3d94/lib64/python3.11/site-packages/rdflib/plugins/serializers/nt.py:40: UserWarning: NTSerializer always uses UTF-8 encoding. Given encoding was: None
  warnings.warn(
<http://example.com/subject> <http://example.com/predicate> "typeless" .
<example:subject> <http://example.com/predicate> <http://example.com/object> .
<urn:example:subject> <example:predicate> <example:subject> .
<example:subject> <example:predicate> "12"^^<http://www.w3.org/2001/XMLSchema#integer> .
<example:subject> <example:predicate> <example:object> .
<http://example.com/subject> <http://example.com/predicate> "日本語の表記体系"@jpx .
<urn:example:subject> <urn:example:predicate> <urn:example:object> .
<example:subject> <http://example.com/predicate> "XSD string" .
status=0

In both cases, the output is wrong, not just incomplete.

I would say the right behaviour is that an error is raised when a ConjunctiveGraph or Dataset with non-default graphs are serialized using a format that does not support named graphs.

I'm making this issue to get some feedback, but I will make a PR to fix it shortly.

There is a similar issue with riot from Jena [ref], but riots behaviour is not quite as bad

aucampia commented 1 year ago

Please let me know if anyone disagrees with the proposed behaviour:

I would say the right behaviour is that an error is raised when a ConjunctiveGraph or Dataset with non-default graphs are serialized using a format that does not support named graphs.

CC: @RDFLib/core-reviewers

WhiteGobo commented 1 year ago

Before wrong output is generated by ConjunctiveGraph.serialize(format=turtle) i would expect an error. Even if Graph.serialize(format=turtle) work.

What would be the correct output? Only the data from the default graph or no data "overwritten" from the default graph? Or is there no (simple) correct turtle output for this conjunctive graph? I ask because i would expect exactly the given output, but i havent worked much with conjunctive graphs.

I lack the correct word for what i mean with overwritten. I mean something like this:

@prefix ns1: <http://example.com/> .
@prefix ns2: <urn:example:> .
@prefix ns3: <example:> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

ns1:subject ns1:predicate "typeless".
#The next line would "overwrite" information so it is ignored
#ns1:subject ns1:predicate  "日本語の表記体系"@jpx .

ns2:subject ns2:predicate ns2:object .

ns3:subject ns3:predicate ns3:object;
    ns1:predicate ns1:object,
        "XSD string" .

aucampia commented 1 year ago

What would be the correct output? Only the data from the default graph or no data "overwritten" from the default graph? Or is there no (simple) correct turtle output for this conjunctive graph?

When trying to serialize named graphs using a format that doesn't support named graphs (e.g. turtle) I think the only options are:

Error: I think the right option is to raise an Exception, if you are streaming output, there would still be some output, but at least the caller can tell something went wrong.
Incomplete output: This is what Jena's riot is doing [ref] - it only outputs the triples from the default graph. This is not wrong output, but incomplete output, as the output is a subset of quads.
Incorrect output: This is what RDFLib is doing. It serializes triples from named graphs without a graph label/name. Thus creating entirely new triples in the output that do not exist in the input.

Only the data from the default graph or no data "overwritten" from the default graph? ...
ns1:subject ns1:predicate "typeless".
#The next line would "overwrite" information so it is ignored
#ns1:subject ns1:predicate  "日本語の表記体系"@jpx .

RDF does not work like this, there is no ability for new triples to affect previous triples. If the same subject and predicate appears twice in an input document with a different object, then you just get another triple. If you want to prevent this, you would need to use SHACL or something in your ingestion pipeline, but just RDF itself just treats it as another triple.

https://www.w3.org/TR/rdf11-concepts/#section-rdf-graph

An RDF graph is a set of RDF triples.

So, as long as every part of the triple is unique in the set, it is a unique triple, and does not invalidate another triple with the same subject and predicate.

niklasl commented 1 year ago

Throwing an error is certainly most explicit. And while I presume it is backwards incompatible, if it is wrong right now, it seems better to error out than to change its behaviour.

It might be good to note how the RDF 1.1 Concepts defines content negotiation of RDF datasets:

If an RDF dataset is returned and the consumer is expecting an RDF graph, the consumer is expected to use the RDF dataset's default graph.

It is understandable why e.g Jena has made that decision, and it warrants consideration. But invoking serialization programmatically is not the same as content negotiation, but rather how you'd implement it, so if the user of the API expects an RDF graph, the user is to use the dataset_or_cg.default_context.

(The error should probably be helpful and suggest that. It might also be helpful to add an alias property named default_graph to Dataset, since "context" is non-standard naming for graphs; but that is a separate issue.)

To reason further,, the "role" of named graphs in a dataset may vary (some use the dataset as a union of graphs, some perhaps as versions of descriptions where only an explicitly chosen subset of them are considered "valid" or "active"). So it makes more sense to force an active choice. It could be useful to make it easy to serialize the union as a stream of triples, but that would be an additional feature.

aucampia commented 1 year ago

I posted to the public-rdf-dev mailing list also, https://lists.w3.org/Archives/Public/public-rdf-dev/2023May/0000.html

ghost commented 1 year ago

The RDF spec asserts:

RDF datasets MAY be used to express RDF content. When used in this way, a dataset SHOULD be understood to have at least the same content as its default graph.

If the user has specified a context-unaware serialization format, it’s not unreasonable to treat this as intentional and to return the serialized triples of the default graph because the Dataset default graph has no name¹ and can therefore be considered as selected by the specification of a context-unaware serialization format.

Is it perhaps more usefully viewed as an implementation-independent means of specifying serialization of a Dataset’s unnamed default graph?

¹ “a single unnamed, default RDF graph”

aucampia commented 1 year ago

If the user has specified a context-unaware serialization format, it’s not unreasonable to treat this as intentional and to return the serialized triples of the default graph because the Dataset default graph has no name¹ and can therefore be considered as selected by the specification of a context-unaware serialization format.

On the other hand, someone could have made a mistake in their code, and did not realize that the format at a specific point is or can be context-unaware. In this case, trying to guess what the user meant would be masking a bug in the users code, where as if they really did mean to serialize only the default context, there would have been an easy way for them to do it explicitly.

Is it perhaps more usefully viewed as an implementation-independent means of specifying serialization of a Dataset’s unnamed default graph?

There are already ways to do this much more explicitly and without ambiguity, which users should use instead:

dataset = Dataset()
## add named graphs and triples
dataset.default_context.serialize(format="turtle")

ghost commented 1 year ago

In this case, trying to guess what the user meant

Neither a guess nor an assumption, simply following the direction of the spec.

... an implementation-independent means of specifying serialization of a Dataset’s unnamed default graph?

There are already ways to do this much more explicitly and without ambiguity, which users should use instead:
dataset = Dataset()
## add named graphs and triples
dataset.default_context.serialize(format="turtle")

That's going to break in v7, dataset.default_context will become dataset.default_graph, my point is that the implementation details intrude, so “explicit” is also “brittle”.

aucampia commented 1 year ago

Neither a guess nor an assumption, simply following the direction of the spec.

If you are referring to this "RDF datasets may be used to express RDF content. When used in this way, a dataset should be understood to have at least the same content as its default graph." [ref] I'm not sure this applies to the snippet I shared. The "may" in the first sentence normative [ref], meaning it may also not be used to express RDF content. So even if you take the should in the second sentence as normative, you still are assuming the first "may" to be operative, at the very least it should be documented as such.

Is that using an RDF dataset to express RDF content? If that is, what would it look like when it is not being used to express RDF content.

That's going to break in v7, dataset.default_context will become dataset.default_graph, my point is that the implementation details intrude, so “explicit” is also “brittle”.

Version 7 is not released so what will break is undefined. But even if the V7 API does introduce breaking changes (as it should), that does not make the current API brittle, that just means we are using semantic versioning as intended.

The way I see it is: I don't like fragile software. If I ask software to do something, and it can't do it, it should error out, not do it half way, because I did not ask it to do it half way. That is why software interface should be as explicit as possible, I should explicitly be able to ask RDFLib, serialize the whole Dataset. To me, this is what I do when I call Dataset.serialize() - I ask it to serialize the whole dataset, but if there is really a good case why Dataset.serialize() should not mean "serialize the whole dataset" - then I think we should just make antoher method, Dataset.serialize_everything() - which does mean "serialize the whole dataset". I think ambiguities in the spec should not translate to ambiguities in our API.

aucampia commented 1 year ago

~~I made a poll, not that it necesarily matters but just to get a sense for preferences:~~

~~https://github.com/RDFLib/rdflib/discussions/2394~~

Never mind, I deleted it, it is a bit weird to have it separate from this, it will just split the conversation more, best that people just respond here with their preference.

aucampia commented 1 year ago

I think the right solution here is:

Change ConjunctiveGraph.serialize() to raise an exception if the selected format does not support quads.
Clarify the documentation of serialize() to indicate it is a request to serialize the whole Dataset or ConjunctiveGraph, and not a subset. This anyway seems like the reasonable thing, I don't see why calling serialize on an object should mean something other than serialize the object. If someone wants to serialize a subset of the Dataset or ConjunctiveGraph they should do so explicitly by selecting the exact subset they want, and then serializing it.

To me, this is the right solution because I think there should be a way to request to serialize the whole Dataset explicitly, and I don't see why it should not be Dataset.serialize(). If however someone can make a good argument why this should not be the way to serialize the whole Dataset I'm willing to consider it, but then we also need to select another way to request that the whole Dataset or ConjunctiveGraph be serialized, and we need to then write down what exactly serialize does if not serialize the object it was called on.