RDFLib / rdflib

RDFLib is a Python library for working with RDF, a simple yet powerful language for representing information.
https://rdflib.readthedocs.org
BSD 3-Clause "New" or "Revised" License
2.18k stars 558 forks source link

[DISCUSSION] Reasons to keep URIRef, BlankNode, and Literal as subclasses of `str`? #2866

Closed ashleysommer closed 1 month ago

ashleysommer commented 3 months ago

There is a particular design choice in RDFLib that has existed since the original version in 2001.

URIRef, BlankNode, and Literal are subclasses of Identifier. Identifer is a combination class, a subclass of both str and the abstract class Node.

It has been this way for a long time. In the Py2->3 transition days, Identifier used to subclass six.text_type. And before that it was a subclass of Python 2's unicode type, right back until the first tracked commit. (Note, python2 unicode type was renamed to str for python3, it is literally the same type. Python2 str became bytes in python3).

Its common to see very old python libraries using a similar pattern. There was good reason. Subclasses of str were treated as "real strings" by the system, that enabled better memory management and provided built-in text features like startswith(), lower(), etc, and comparisons like eq, gt, lt, gte, lte that the subclass gets for free and work exactly like a native string.

Additionally in Python prior to v2.1 "string interning" worked on subclasses of strings. That means that short constant strings, and strings managed using sys.intern() were managed using a global string table, were deduplicated in memory and had shortcuts like pre-calculated hashs. But for Python 2.1+, that was considered a bug and it was fixed so subclasses of str could no longer be interned.

One thing to note however, in the Python2 days, Identifier wasn't a subclass of str, but unicode which I don't think had an intern table, so RDFLib never benefited from that.

That pattern in RDFLib has been maintained for 21 years without question, because "its always been that way", but I suggest it is hurting RDFLib's usability and performance. The main cause being an enormous amount of needless copying strings that happens when creating URIRef, BlankNode, and Literal instances, and reading them back out.

Examples:

>>> from rdflib import URIRef
>>> mystring = "https://example.org/example"
>>> id(mystring)
126135362697728
>>> myuri = URIRef(mystring)
>>> id(myuri)
126135340674256
>>> innerstring1 = str(myuri)
>>> id(innerstring1)
126135358248864
>>> innerstring1 is mystring
False
>>> innerstring2 = str.__str__(myuri)
>>> id(innerstring2)
126135353645280
>>> innerstring2 is mystring
False
>>> innerstring3 = super(Identifier, myuri).__str__()
>>> id(innerstring3)
126135350065744
>>> innerstring3 is mystring
False

Firstly, when you create a URIRef (or any subclass of Identifier) and pass in a str, the __new__ constructor will hand that directly into the internal string constructor (return str.__new__(input)) however python knows its really part of a subclass of str, so it doesn't use the copy-on-write (CoW) deduplication optimisation, instead it takes a COPY of the string passed in, and saves it.

But then there is now no way of getting back that string it copied and saved. If you want to serialize that URIRef or treat it like a string you can read from, you need to call str(myuri) on it. But that doesn't give you back the original string, or even the first copy it made. It makes a new COPY of the internal string. Even if you try to treat the variable as a real string, calling str.__str__(myuri) does the same thing, it makes another new copy. Finally, if you try to tap into the innerstring using super().__str__() it gives you the same result. There is no way to read the inner contents of a subclass of string without copying it.

Python3 can still do intern strings. It is a major memory management and performance speedup. All static constant strings in your code that are 20 characters or less will be saved by the parser as interned strings, and will always refer to the same object.

>>> "myconstant"  #short constants are interened automatically
'myconstant'
>>> myvar1 = "myconstant"
>>> myvar2 = str("myconstant")
>>> myvar1 is myvar2
True

and for a long string:

>>> myconstant = sys.intern("https://example.org/example")
>>> myvar1 = str(myconstant)
>>> myvar2 = str(myconstant)
>>> myconstant is myvar1 and myvar1 is myvar2
True

You can see each of the different variables and new strings that were created with the interned string are instances of the exact same string in memory.

RDFLib cannot take advantage of this however, because string interning DOES NOT work on subclasses of str.

>>> myconstant = sys.intern("https://example.org/example")
>>> myvar1 = URIRef(myconstant)
>>> myvar2 = URIRef(myconstant)
>>> myconstant is myvar1
False
>>> myvar1 is myvar2
False

The reason is to do with immutability. Python treats all real strings as immutable. Any operation you do on a string will leave the original string untouched and give you back a new string. The guarantee of string immutability is something provided by the stringlib type interface in stdlib. That is why strings can be interned, identical interned strings will return the same object because they are immutable. However Python can't make those same guarantees about immutability on subclasses of strings. A subclass can change its internal contents while identifying as the same instance. That is why python must COPY the source string. Passing a real str into URIRef() constructor takes a copy of the original string, it can't use an internal reference to the original string because the copy is mutable, it needs to retain the original as immutable. Similarly when reading it back out as a str, you can't simply read out the internal state as a string itself, because python needs to make an immutable snapshot of the subclass, it does that by taking a COPY.

I was half way through making a PR to improve the performance and memory usage of URIRef creation and serialization by taking advantage of sys.intern() for frequently used URIRefs, when I started putting all the pieces of this issue together.

It looks like RDFLib will need a fundamental shift in architecture of its most used foundation classes if we want to resolve this, so I think it will be a good issue to tackle for v8.0.

ashleysommer commented 3 months ago

Another thing to point out that didn't need to be in the initial writeup: Python has a special optimisation for dicts that are "all-string-keys", meaning where every key in the dict is a str. That allows the dict backend to do more efficient lookups where it knows all keys are strings, and even more-so when those keys are interned strings.

RDFLib's Memory store works surprisingly well using just a series of nested dicts as indexes. The keys for those dicts are instances of URIRef or BNode, and the values are instances of URIRef, Bnode or Literal. However these index dicts cannot take advantage of the all-string-keys optimization, so there could be big performance left on the table.

Memory usage could also be reduced if we could be interning the most commonly used strings. Think of rdf:type, its full form is http://www.w3.org/1999/02/22-rdf-syntax-ns#type. Every time you call RDF.type lookup from the DefinedNamespace, it gives you a NEW COPY of the URIRef. Every single instance of URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#type') in the MemoryStore indexes (there could be thousands) is an individual COPY.

ashleysommer commented 3 months ago

I have plans to write an update to URIRef (or all Identifier classes) that does not subclass from str. Instead it could be a wrapper, with a private property slot to hold a reference to a string it was passed in the constructor. That way no copy of the source string is made, and it allows the source string to be an interned string if applicable too. Then the __str__ method would simply return unchanged the wrapped string, again avoiding the additional copy.

After big update is in place, I plan to add a new feature in DefinedNamespace for the ability for some popular Namespaces to have their most popular members available as pre-interned instances, So it doesn't need to make a copy of the string, and doesn't even need to make a copy of the URIRef instance, it can simply always return the same instance.

And finally, I plan to create a big update to MemoryStore (or a new Store backend) that leverages the all-string-keys python dict optimisation by unwrapping URIRef and BNode instances into real strs before pushing them into the store index.

Technically none of this is a breaking change. The fact URIRef, BNode and Literal are subclass of str is not necessarily part of the public RDFLib API surface, and changing it wouldn't typically have any user-facing changes, but its one of the things that has been that way for so long there is certainly going to be software out there that relies on it to work exactly a certain way. So I think it makes sense to delay it to the "big-changes" version 8.0, where ConjunctiveGraph is being removed too. This might even be a good opportunity to rename URIRef to IRI as suggested here or even NamedNode as is commonly used in modern projects like Rdf-JS, N3.js, and Oxigraph.

ashleysommer commented 3 months ago

I did a quick sanity check, to see if we would lose anything (performance, compatibility, interoperability, memory usage) by moving away from Identifier being a subclass of str.

Simply moving from str base type to using an inner wrapped string ("inner_") show approx 0.02% memory usage reduction (this is good) however around 10% execution time penalty due to the new indirect call to Identifier.__hash__() (this is bad).

I'm not sure if any potential gains from the ability to use pre-hashed system-interned strings would offset this 10% performance hit here.

ashleysommer commented 3 months ago

Further testing confirms my earlier findings. It seems there is good reason to keep Literal, Identifier, BNode etc a subclass of str. That is because we can directly call str.__hash__() on the variable. And rdflib does this automatically by setting the __hash__ class function function of Identifier to be __hash__ = str.__hash__.

The benefit is that str.__hash__ is a C-function, it is defined in the CPython source code, meaning hash lookups are called directly from the internal dict implementation without touching any python code.

Replacing that with the new class equivalent results in 10% slowdown:

__hash__ = str.__hash__  # old
#---------
def __hash__(self) -> int:  # new
    return str.__hash__(self.inner_)

I tested with just a minimal hash function on Identifier.:

def __hash__(self) -> int:
    return 0

And even this non-implementation is 10% slower than rdflib's current method. So that means the slowdown is simply the result of calling back into a python function, opposed to the C function.

I also tested other benefits I mentioned in the original post, eg, pre-calculated hashes and using system.intern strings, and these features help, however none of them make up for the 10% overhead incurred by the new __hash__ call.

Maybe this thread can be closed.

nicholascar commented 3 months ago

Good 'ol things are as good as they can be because we used C in 2003...

nicholascar commented 1 month ago

Keeping as is due to faster string hashing