RDFLib / rdflib

RDFLib is a Python library for working with RDF, a simple yet powerful language for representing information.
https://rdflib.readthedocs.org
BSD 3-Clause "New" or "Revised" License
2.11k stars 547 forks source link

Bug: Unexpected namespace creation during turtle file serialization #2779

Open mickremedi opened 2 months ago

mickremedi commented 2 months ago

Hello! I've noticed that serializing a ttl file has an unexpected behavior where adding a triple to a blank graph and then serializing it randomly adds a prefix to the turtle file:

import rdflib

TRIPLE = (
    rdflib.URIRef("http://example1.com/s"),
    rdflib.URIRef("http://example2.com/p"),
    rdflib.Literal("some literal"),
)

g = rdflib.Graph(bind_namespaces="none")
g.add(TRIPLE)

print("Namespaces Before:", list(g.namespaces()))

x = g.serialize(format="turtle")

print(x)
print("Namespaces After:", list(g.namespaces()))

Results in:

Namespaces Before: []
@prefix ns1: <http://example2.com/> .

<http://example1.com/s> ns1:p "some literal" .

Namespaces After: [('ns1', rdflib.term.URIRef('http://example2.com/'))]

When someone would expect:

Namespaces Before: []
<http://example1.com/s> <http://example2.com/p> "some literal" .

Namespaces After: []

I've boiled it down to the following line: https://github.com/RDFLib/rdflib/blob/fb43b7afe80175aedd87506899dff2ccdb312c66/rdflib/plugins/serializers/turtle.py#L270

Here we create a new prefix if we're looking at the predicate of a triple during serialization. I can't follow the blame of this change or docs explaining that serialize modifies the graph. Does anyone know why this was put there and if it can be set to self.getQName(node, gen_prefix=False)? This seems to have already been done for trig files #2467 .

sardormajano commented 1 month ago

Running into the same issue

seo-chang commented 1 month ago

Please solve this!

mickremedi commented 1 month ago

Quick Note: I've been able to patch this bug for now by overriding the getQName() method:

class FixedTurtleSerializer(TurtleSerializer):
    def getQName(self, uri, gen_prefix=True):
        return super().getQName(uri, gen_prefix=False)

This fixes the fact that there are multiple places in this serializer that call the method. I'm considering throwing in a PR to adjust the behavior of serialize to not generate namespaces by default since:

A possible method could look like:

g = rdflib.Graph(bind_namespaces="none")
serialized_without_prefixes = g.serialize(format="turtle", generate_prefixes=False)
serialized_with_prefixes = g.serialize(format="turtle", generate_prefixes=True)

Any thoughts? I could also go with the reverse approach where the default behavior remains the same and an optional param is added to disable the predicate prefix generation. This would be non-breaking, but could be less intuitive for new users.

nicholascar commented 1 month ago

Having the two options - to generate and to not generate prefixes - with a documented default sounds great, please do make a PR!