RDFLib / rdflib

RDFLib is a Python library for working with RDF, a simple yet powerful language for representing information.
https://rdflib.readthedocs.org
BSD 3-Clause "New" or "Revised" License
2.17k stars 556 forks source link

Rename prefix #831

Closed Davidswinkels closed 6 years ago

Davidswinkels commented 6 years ago

Love the Python library to make RDF! Keep it going.

One issue I don't know how to fix and I didn't find explicitly in the RDFLib 5.0.0 documentation is renaming a prefix. For example, I have been given a graph from an external source and don't like the way they did their prefixes. Therefore I want to rename the prefix at the start and in the graph. This is what I parsed from a link:

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>.
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix n80: <http://www.opengis.net/ont/geosparql#>.
n80:Geometry rdf:type <http://www.w3.org/2002/07/owl#Class>;
    rdfs:isDefinedBy <http://www.opengis.net/ont/geosparql#>
.
n80:hasGeometry rdf:type <http://www.w3.org/2002/07/owl#ObjectProperty>;
    rdfs:isDefinedBy <http://www.opengis.net/ont/geosparql#>
.

This is where I want to go:

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>.
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix geo: <http://www.opengis.net/ont/geosparql#>.
geo:Geometry rdf:type <http://www.w3.org/2002/07/owl#Class>;
    rdfs:isDefinedBy <http://www.opengis.net/ont/geosparql#>
.
geo:hasGeometry rdf:type <http://www.w3.org/2002/07/owl#ObjectProperty>;
    rdfs:isDefinedBy <http://www.opengis.net/ont/geosparql#>
.

This is how I tried to solve it, but which didn't work.

from rdflib import Namespace, Graph

buildings_ontology = 'https://bag.basisregistraties.overheid.nl/query/model?format=ttl'
g = Graph()
g.parse(buildings_ontology, format = 'n3')
geo = Namespace('http://www.opengis.net/ont/geosparql#')
g.namespace_manager.bind('n80', geo, override = True, replace=True)

How would you rename the prefix at the start and in the graph?

Since I think there is no method or function that does that, I made an issue on GitHub. For now this issue is fixed by downloading as text and replacing 'n80:' for 'geo:'.

mwatts15 commented 6 years ago

I used the example you give above. This seems to do what's intended. It seems like a deficiency in the interface that you can't "unbind" a prefix, but you can bind it something that's unexpected.

from rdflib import Namespace, Graph

# buildings_ontology = 'https://bag.basisregistraties.overheid.nl/query/model?format=ttl'
buildings_ontology = '831.n3'
g = Graph()
g.parse(buildings_ontology, format='n3')
geo = Namespace('http://www.opengis.net/ont/geosparql#')
not_geo = Namespace('_NotANamespace')
print(list(g.namespace_manager.namespaces()))
g.namespace_manager.bind('n80', not_geo, override=True, replace=True)
g.namespace_manager.bind('geo', geo, override=True, replace=True)
print(list(g.namespace_manager.namespaces()))
g.serialize('831-1.n3', format='n3')
Davidswinkels commented 6 years ago

Hey mwatts. Thanks for helping out. Your code changes the prefix at the start in the namespace definition. However it does not change namespace in graph consistently. Sometimes it changes it correctly and sometimes a p is added to the prefix namespace.

This is resulting code:

@prefix gov: <https://www.rijksoverheid.nl/documenten/kamerstukken/2011/01/10/> .
@prefix n69: <_NotANamespace> .
@prefix pn69: <https://www.rijksoverheid.nl/documenten/kamerstukken/2011/01/10/> .
pn69:memorie-van-toelichting-wet-basisregistraties-adressen-en-gebouwen a <http://purl.org/dc/dcmitype/Text> ;
    rdfs:label "Memorie van toelichting Wet basisregistraties adressen en gebouwen"@nl ;
    rdfs:seeAlso pn69:memorie-van-toelichting-wet-basisregistraties-adressen-en-gebouwen .

This is preferred outcome. Only keeping the new namespace:

@prefix gov: <https://www.rijksoverheid.nl/documenten/kamerstukken/2011/01/10/> .
gov:memorie-van-toelichting-wet-basisregistraties-adressen-en-gebouwen a <http://purl.org/dc/dcmitype/Text> ;
    rdfs:label "Memorie van toelichting Wet basisregistraties adressen en gebouwen"@nl ;
    rdfs:seeAlso gov:memorie-van-toelichting-wet-basisregistraties-adressen-en-gebouwen .

This is code used:

#buildings_ontology = 'https://bag.basisregistraties.overheid.nl/query/model?format=ttl'
buildings_ontology = '831.n3'
g = Graph()
g.parse(buildings_ontology, format='n3')

print('Unchanged input graph ------------------------------------------------------------ ')
print(g.serialize(format='n3').decode())

gov = Namespace('https://www.rijksoverheid.nl/documenten/kamerstukken/2011/01/10/')
notgov = Namespace('_NotANamespace')
g.namespace_manager.bind('n69', notgov, override=True, replace=True)
g.namespace_manager.bind('gov', gov, override=True, replace=True)

print('Changed output graph ------------------------------------------------------------')
print(g.serialize(format='n3').decode())

So this functionality would be great addition to RDFLib:

When I have more time this week, I'll try to test why the prefix '_p_foo' is added.

mwatts15 commented 6 years ago

I notice that calls to serializer.serialize aren't totally independent -- if you remove the first call in your snippet, the expected output is produced. There's also the fact that unused @prefix lines are produced in the output. I have a pull request out that prevents that. I haven't tested my change specifically although the existing tests pass. YMMV.

Davidswinkels commented 6 years ago

Good that you solved the unused lines. On my side there is still some weird behaviour of namespaces/prefixes. Made a small more reproducible example. Let's solve one issue at a time and focus on the duplicate prefix 'pn80'. The unwanted prefix 'pn80' is created when a new namespaces overrides the current prefix.

This is result of replacing namespace:

@prefix n80: <_NotANamespace> .
@prefix pn80: <http://www.opengis.net/ont/geosparql#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xml: <http://www.w3.org/XML/1998/namespace> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

pn80:Geometry a <http://www.w3.org/2002/07/owl#Class> ;
    rdfs:isDefinedBy <http://www.opengis.net/ont/geosparql#/test> .

This is preferred outcome:

@prefix n80: <_NotANamespace> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xml: <http://www.w3.org/XML/1998/namespace> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

n80:Geometry a <http://www.w3.org/2002/07/owl#Class> ;
    rdfs:isDefinedBy <http://www.opengis.net/ont/geosparql#/test> .

This is the small reproducible script that creates the bug:

from rdflib import URIRef, Namespace, Graph
from rdflib.namespace import RDF, OWL, RDFS

g = Graph()

## Example graph parsed from another source
n80 = Namespace('http://www.opengis.net/ont/geosparql#')
g.bind(prefix='n80', namespace=n80)
g.add((n80.Geometry, RDF.type, OWL.Class))
g.add((n80.Geometry, RDFS.isDefinedBy,URIRef('http://www.opengis.net/ont/geosparql#/test')))
print('External graph with undefined namespace')
print(g.serialize(format='turtle').decode())

## Replace namespace with '_NotANamespace'
notgeo = Namespace('_NotANamespace')
geo = Namespace('http://www.opengis.net/ont/geosparql#')
g.namespace_manager.bind(prefix='n80', namespace = notgeo, override = True, replace = True)
#g.namespace_manager.bind(prefix='geo', namespace=geo, override=True, replace=True)

print('External graph with defined namespace')
print(g.serialize(format='turtle').decode())

Was looking at the source code of the bind function (see below) and found that function self.store.bind(prefix, namespace) somehow makes a duplicate prefix 'pn80'. After that I'm not sure what is happening in the self.store.bind function :

    def bind(self, prefix, namespace, override=True, replace=False):

        """bind a given namespace to the prefix

        if override, rebind, even if the given namespace is already
        bound to another prefix.

        if replace, replace any existing prefix with the new namespace

        """

        namespace = URIRef(str(namespace))
        # When documenting explain that override only applies in what cases
        if prefix is None:
            prefix = ''
        bound_namespace = self.store.namespace(prefix)
        # Check if the bound_namespace contains a URI
        # and if so convert it into a URIRef for comparison
        # This is to prevent duplicate namespaces with the
        # same URI
        if bound_namespace:
            bound_namespace = URIRef(bound_namespace)
        if bound_namespace and bound_namespace != namespace:

            if replace:
                **self.store.bind(prefix, namespace)** ## Here somehow duplicate prefix is created
                return

            # prefix already in use for different namespace
            #
            # append number to end of prefix until we find one
            # that's not in use.
            if not prefix:
                prefix = "default"
            num = 1
            while 1:
                new_prefix = "%s%s" % (prefix, num)
                tnamespace = self.store.namespace(new_prefix)
                if tnamespace and namespace == URIRef(tnamespace):
                    # the prefix is already bound to the correct
                    # namespace
                    return
                if not self.store.namespace(new_prefix):
                    break
                num += 1
            self.store.bind(new_prefix, namespace)
        else:
            bound_prefix = self.store.prefix(namespace)
            if bound_prefix is None:
                self.store.bind(prefix, namespace)
            elif bound_prefix == prefix:
                pass  # already bound
            else:
                if override or bound_prefix.startswith("_"):  # or a generated
                                                              # prefix
                    self.store.bind(prefix, namespace)