RDFLib / rdflib

RDFLib is a Python library for working with RDF, a simple yet powerful language for representing information.
https://rdflib.readthedocs.org
BSD 3-Clause "New" or "Revised" License
2.17k stars 555 forks source link

Incongruity in type signature of NamespaceManager #2004

Open ajnelson-nist opened 2 years ago

ajnelson-nist commented 2 years ago

Is there a reason rdflib.namespace.NamespaceManager.namespaces has the return signature Iterable[Tuple[str, URIRef]] - specifically, URIRef instead of Namespace?

aucampia commented 2 years ago

@ajnelson-nist that is in primarily because it will return URIRef objects and not Namespace objects:

From monkeytype:

20220729T042949 iwana@teekai.zoic.eu.org:~/sw/d/github.com/iafork/rdflib.monkeytype
$ task run -- monkeytype stub rdflib.namespace | tee rdflib/namespace/__init__.pyi
task: [run] 
VIRTUAL_ENV="/home/iwana/sw/d/github.com/iafork/rdflib.monkeytype/.venv" PATH="/home/iwana/sw/d/github.com/iafork/rdflib.monkeytype/.venv/bin${PATH:+:${PATH}}" monkeytype stub rdflib.namespace

55 traces failed to decode; use -v for details
from rdflib.term import URIRef
from typing import (
    Any,
    Dict,
    Iterable,
    List,
    Optional,
    Tuple,
    Type,
    Union,
)

# ...elided...

class NamespaceManager:
    def __init__(self, graph: Graph, bind_namespaces: _NamespaceSetString = ...) -> None: ...
    def _store_bind(self, prefix: str, namespace: URIRef, override: bool) -> None: ...
    def absolutize(self, uri: str, defrag: int = ...) -> URIRef: ...
    def bind(self, prefix: Optional[str], namespace: Any, override: bool = ..., replace: bool = ...) -> None: ...
    def compute_qname(self, uri: str, generate: bool = ...) -> Tuple[str, URIRef, str]: ...
    def compute_qname_strict(self, uri: str, generate: bool = ...) -> Tuple[str, str, str]: ...
    def expand_curie(self, curie: str) -> Optional[URIRef]: ...
    def namespaces(self) -> Iterable[Tuple[str, URIRef]]: ...
    def normalizeUri(self, rdfTerm: str) -> str: ...
    def qname(self, uri: str) -> str: ...
    def qname_strict(self, uri: str) -> str: ...
    def reset(self) -> None: ...
    @property
    def store(self) -> Store: ...

# ...elided...

And from pytest-annotate

    {
        "path": "rdflib/namespace/__init__.py",
        "line": 707,
        "func_name": "NamespaceManager.namespaces",                                                                                                                          
        "type_comments": [
            "() -> Iterator",
            "() -> Iterator[Tuple[str, rdflib.term.URIRef]]"
        ],
        "samples": 2051
    },
ajnelson-nist commented 2 years ago

@aucampia That is an informative view of the inertia of that type signature. However, I believe there are good reasons to upset it.

In particular, I'm concerned with the interface between the .namespaces method and Graph.bind(). .bind hasn't received type review yet. Store.bind() types namespace as a URIRef. As I noted on another thread this morning, I'm not so familiar with the Store code tree. However, I do make frequent use of Graph.bind.

I personally expect .namespaces to give me objects that feed into Graph.bind, and I further expect the objects from .namespaces to be Namespaces because I would like to be able to use the [] or . operators to reference terms within a namespace.

I've seen significant confusion arise from these concepts being muddied, some back since the pre-RDF days of XML:

Each of these is still a distinct concept. Some current technologies make confusions of them all come data-serialization time. FOAF ends its ontology IRI with a slash, which OWL permits and which allows a handy and inoffensive confusion of ontology IRI, namespace, and string prefix. But I'm particularly thinking of JSON-LD's context dictionary that uses string prefixes, which are not necessarily namespaces. I encountered this when dealing with the differences in forward-slash parsing between SPARQL, Turtle, and JSON-LD to handle IANA Media types and their slashes. (IANA Media Types don't have an IANA-provide ontology to my knowledge, but Dublin Core Terms provides http://purl.org/dc/terms/IMT, and a "namespace for MIME media types vocabulary" from the Publishing Metadata documnt: <http://purl.org/NET/mediatypes/>.)

So, I wish to argue for precision in the type designation for .namespaces(), and let the type effects bubble out from there.

If you agree with this way forward, I will put some effort into the type designation with a PR.

aucampia commented 2 years ago

In particular, I'm concerned with the interface between the .namespaces method and Graph.bind(). .bind hasn't received type review yet. Store.bind() types namespace as a URIRef. As I noted on another thread this morning, I'm not so familiar with the Store code tree. However, I do make frequent use of Graph.bind.

I personally expect .namespaces to give me objects that feed into Graph.bind, and I further expect the objects from .namespaces to be Namespaces because I would like to be able to use the [] or . operators to reference terms within a namespace.

Store, NamespaceManager and Graph will have more or less complete type signatures with https://github.com/RDFLib/rdflib/pull/2080 - the type signature for Graph.bind() is:

https://github.com/RDFLib/rdflib/blob/97f88d3200bedcc22ec100bffd5d17b1d57fd122/rdflib/graph.py#L1191-L1197

In here, the namespace has type Any, but for a very good reason which is more or less related to your concern here.

Currently we have namespaces occurring with the following types:

The class diagram for these types is:

class diagram

If not for DefinedNamespace, all classes used to represent namespaces would derive from str, but because DefinedNamespace does not there is no common superclass, other than object which is the base class for all new style classes.

However, all these classes do have __str__, and all values of namespace passed to Graph.bind() gets converted to URIRef by first calling str() on them (which calls __str__) and then by using that to construct a URIRef:

https://github.com/RDFLib/rdflib/blob/97f88d3200bedcc22ec100bffd5d17b1d57fd122/rdflib/namespace/__init__.py#L639-L655

So really, anything can be passed as the namespace argument to Graph.bind() and it will work, maybe there are some corner cases where we will raise an exception for invalid URIs, but I think currently we just warn for invalid URIs.

But this is also why the the following three methods return URIRef, and not Namespace:

If you pass a Namespace object to Graph.bind(), it won't be stored as a Namespace, but as a URIRef, and if we change the typing of Graph.namespaces() to be Namespace instead of URIRef, then the outcome will be that code will pass mypy validation but cause exceptions at runtime, there is no way under normal operation that you will get a it to return a Namespace object. You could maybe get something if you add it to Store implementations directly, but that will only work for some store implementations, and won't work for example if you use berkelydb:

https://github.com/RDFLib/rdflib/blob/97f88d3200bedcc22ec100bffd5d17b1d57fd122/rdflib/plugins/stores/berkeleydb.py#L587-L598

It would be better if Graph.namespaces() did return a Namespace, but this would require changes in more than type hints, it would require that we change

https://github.com/RDFLib/rdflib/blob/97f88d3200bedcc22ec100bffd5d17b1d57fd122/rdflib/namespace/__init__.py#L655

to something like

        namespace = Namespace(str(namespace))

(and some additional changes)

We should look at doing this in a future version of RDFLib, however we need quite a big overhaul of our public interfaces in general, as shared in some other places, I think the right approach here is that we should design a new Graph interface addressing most of the problems that we know of in rdflib._experimental, and then once we are happy with the interface redesign we can proceed to release version 7. The alternative will be that every release becomes a major version release, as there is many things in the interface that needs fixing and doing it all in one PR will make for one very hard to review PR, and it is we will likely have some fluctuations in each version as we try and stick everything together.

We should also really have architectural decision records for interface breakages, and ideally there should be a deprecation notice or warning in at least one version with some option for adopting the new behaviour.

ajnelson-nist commented 2 years ago

I had a sneaking feeling there would be some reason due to technology breadth.

Thanks to your thorough explanation---and thanks for it!---I agree with this needing to be in a 7.0.0 release.

Is there a development branch where such effects are being pooled, or is there some new module _experimental.py where these will be pooled in the master branch?

aucampia commented 2 years ago

Is there a development branch where such effects are being pooled, or is there some new module _experimental.py where these will be pooled in the master branch?

I will try to find time to make an ADR this weekend to define _experimental and the approach to interface breaking changes in general, once that is in place I will make an ADR for new interfaces. In general we should try and approach the problem similar to collections.abc with fairly minimal base base classes that provide very specific functionality so we can easily mix and match them.

But even with this we should be as conservative as possible with interface breakages, one other option here is to change Namespace and DefinedNamespace to inherit from URIRef, and then we can actually change the code just fine without breaking interfaces.

Ideally there should be issues for all problems with the existing interfaces, and most of them should be labelled as backwards incompatible and enhancement, I did not label this with backwards incompatible though as there may be ways to fix this without breaking backwards compatibility.