Open aucampia opened 1 year ago
Please let me know if anyone disagrees with the proposed behaviour:
I would say the right behaviour is that an error is raised when a
ConjunctiveGraph
orDataset
with non-default graphs are serialized using a format that does not support named graphs.
CC: @RDFLib/core-reviewers
Before wrong output is generated by ConjunctiveGraph.serialize(format=turtle)
i would expect an error. Even if Graph.serialize(format=turtle)
work.
What would be the correct output? Only the data from the default graph or no data "overwritten" from the default graph? Or is there no (simple) correct turtle output for this conjunctive graph? I ask because i would expect exactly the given output, but i havent worked much with conjunctive graphs.
I lack the correct word for what i mean with overwritten. I mean something like this:
@prefix ns1: <http://example.com/> .
@prefix ns2: <urn:example:> .
@prefix ns3: <example:> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
ns1:subject ns1:predicate "typeless".
#The next line would "overwrite" information so it is ignored
#ns1:subject ns1:predicate "日本語の表記体系"@jpx .
ns2:subject ns2:predicate ns2:object .
ns3:subject ns3:predicate ns3:object;
ns1:predicate ns1:object,
"XSD string" .
What would be the correct output? Only the data from the default graph or no data "overwritten" from the default graph? Or is there no (simple) correct turtle output for this conjunctive graph?
When trying to serialize named graphs using a format that doesn't support named graphs (e.g. turtle) I think the only options are:
Only the data from the default graph or no data "overwritten" from the default graph? ...
ns1:subject ns1:predicate "typeless". #The next line would "overwrite" information so it is ignored #ns1:subject ns1:predicate "日本語の表記体系"@jpx .
RDF does not work like this, there is no ability for new triples to affect previous triples. If the same subject and predicate appears twice in an input document with a different object, then you just get another triple. If you want to prevent this, you would need to use SHACL or something in your ingestion pipeline, but just RDF itself just treats it as another triple.
https://www.w3.org/TR/rdf11-concepts/#section-rdf-graph
An RDF graph is a set of RDF triples.
So, as long as every part of the triple is unique in the set, it is a unique triple, and does not invalidate another triple with the same subject and predicate.
Throwing an error is certainly most explicit. And while I presume it is backwards incompatible, if it is wrong right now, it seems better to error out than to change its behaviour.
It might be good to note how the RDF 1.1 Concepts defines content negotiation of RDF datasets:
If an RDF dataset is returned and the consumer is expecting an RDF graph, the consumer is expected to use the RDF dataset's default graph.
It is understandable why e.g Jena has made that decision, and it warrants consideration. But invoking serialization programmatically is not the same as content negotiation, but rather how you'd implement it, so if the user of the API expects an RDF graph, the user is to use the dataset_or_cg.default_context
.
(The error should probably be helpful and suggest that. It might also be helpful to add an alias property named default_graph
to Dataset
, since "context" is non-standard naming for graphs; but that is a separate issue.)
To reason further,, the "role" of named graphs in a dataset may vary (some use the dataset as a union of graphs, some perhaps as versions of descriptions where only an explicitly chosen subset of them are considered "valid" or "active"). So it makes more sense to force an active choice. It could be useful to make it easy to serialize the union as a stream of triples, but that would be an additional feature.
I posted to the public-rdf-dev mailing list also, https://lists.w3.org/Archives/Public/public-rdf-dev/2023May/0000.html
The RDF spec asserts:
RDF datasets MAY be used to express RDF content. When used in this way, a dataset SHOULD be understood to have at least the same content as its default graph.
If the user has specified a context-unaware serialization format, it’s not unreasonable to treat this as intentional and to return the serialized triples of the default graph because the Dataset default graph has no name¹ and can therefore be considered as selected by the specification of a context-unaware serialization format.
Is it perhaps more usefully viewed as an implementation-independent means of specifying serialization of a Dataset’s unnamed default graph?
If the user has specified a context-unaware serialization format, it’s not unreasonable to treat this as intentional and to return the serialized triples of the default graph because the Dataset default graph has no name¹ and can therefore be considered as selected by the specification of a context-unaware serialization format.
On the other hand, someone could have made a mistake in their code, and did not realize that the format at a specific point is or can be context-unaware. In this case, trying to guess what the user meant would be masking a bug in the users code, where as if they really did mean to serialize only the default context, there would have been an easy way for them to do it explicitly.
Is it perhaps more usefully viewed as an implementation-independent means of specifying serialization of a Dataset’s unnamed default graph?
There are already ways to do this much more explicitly and without ambiguity, which users should use instead:
dataset = Dataset()
## add named graphs and triples
dataset.default_context.serialize(format="turtle")
In this case, trying to guess what the user meant
Neither a guess nor an assumption, simply following the direction of the spec.
... an implementation-independent means of specifying serialization of a Dataset’s unnamed default graph?
There are already ways to do this much more explicitly and without ambiguity, which users should use instead:
dataset = Dataset() ## add named graphs and triples dataset.default_context.serialize(format="turtle")
That's going to break in v7, dataset.default_context
will become dataset.default_graph
, my point is that the implementation details intrude, so “explicit” is also “brittle”.
Neither a guess nor an assumption, simply following the direction of the spec.
If you are referring to this "RDF datasets may be used to express RDF content. When used in this way, a dataset should be understood to have at least the same content as its default graph." [ref] I'm not sure this applies to the snippet I shared. The "may" in the first sentence normative [ref], meaning it may also not be used to express RDF content. So even if you take the should in the second sentence as normative, you still are assuming the first "may" to be operative, at the very least it should be documented as such.
Is that using an RDF dataset to express RDF content? If that is, what would it look like when it is not being used to express RDF content.
That's going to break in v7,
dataset.default_context
will becomedataset.default_graph
, my point is that the implementation details intrude, so “explicit” is also “brittle”.
Version 7 is not released so what will break is undefined. But even if the V7 API does introduce breaking changes (as it should), that does not make the current API brittle, that just means we are using semantic versioning as intended.
The way I see it is: I don't like fragile software. If I ask software to do something, and it can't do it, it should error out, not do it half way, because I did not ask it to do it half way. That is why software interface should be as explicit as possible, I should explicitly be able to ask RDFLib, serialize the whole Dataset. To me, this is what I do when I call Dataset.serialize()
- I ask it to serialize the whole dataset, but if there is really a good case why Dataset.serialize()
should not mean "serialize the whole dataset" - then I think we should just make antoher method, Dataset.serialize_everything()
- which does mean "serialize the whole dataset". I think ambiguities in the spec should not translate to ambiguities in our API.
I made a poll, not that it necesarily matters but just to get a sense for preferences:
Never mind, I deleted it, it is a bit weird to have it separate from this, it will just split the conversation more, best that people just respond here with their preference.
I think the right solution here is:
ConjunctiveGraph.serialize()
to raise an exception if the selected format does not support quads.serialize()
to indicate it is a request to serialize the whole Dataset or ConjunctiveGraph, and not a subset. This anyway seems like the reasonable thing, I don't see why calling serialize on an object should mean something other than serialize the object. If someone wants to serialize a subset of the Dataset or ConjunctiveGraph they should do so explicitly by selecting the exact subset they want, and then serializing it.To me, this is the right solution because I think there should be a way to request to serialize the whole Dataset explicitly, and I don't see why it should not be Dataset.serialize()
. If however someone can make a good argument why this should not be the way to serialize the whole Dataset I'm willing to consider it, but then we also need to select another way to request that the whole Dataset
or ConjunctiveGraph
be serialized, and we need to then write down what exactly serialize does if not serialize the object it was called on.
If RDFLib is used to serialize a
Dataset
orConjunctiveGraph
that contains non-default graphs [ref] as a format that does not support named graphs (i.e. N-Triples or Turtle) no error is raised, and the output is wrong.Given this data in
test/data/variants/diverse_quads.nq
(equivalent to this trig document):https://github.com/RDFLib/rdflib/blob/ddcc4eb622a000cf991f9c530d55d62115484fca/test/data/variants/diverse_quads.nq#L1-L10
Using rdfpipe to convert it to turtle gives:
And using rdfpipe to convert it to ntriples gives:
In both cases, the output is wrong, not just incomplete.
I would say the right behaviour is that an error is raised when a
ConjunctiveGraph
orDataset
with non-default graphs are serialized using a format that does not support named graphs.I'm making this issue to get some feedback, but I will make a PR to fix it shortly.
There is a similar issue with riot from Jena [ref], but riots behaviour is not quite as bad