Closed ashleysommer closed 1 month ago
Another thing to point out that didn't need to be in the initial writeup:
Python has a special optimisation for dicts that are "all-string-keys", meaning where every key in the dict is a str
. That allows the dict backend to do more efficient lookups where it knows all keys are strings, and even more-so when those keys are interned strings.
RDFLib's Memory
store works surprisingly well using just a series of nested dicts as indexes. The keys for those dicts are instances of URIRef
or BNode
, and the values are instances of URIRef
, Bnode
or Literal
. However these index dicts cannot take advantage of the all-string-keys
optimization, so there could be big performance left on the table.
Memory usage could also be reduced if we could be interning the most commonly used strings.
Think of rdf:type
, its full form is http://www.w3.org/1999/02/22-rdf-syntax-ns#type
.
Every time you call RDF.type
lookup from the DefinedNamespace, it gives you a NEW COPY of the URIRef.
Every single instance of URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#type')
in the MemoryStore indexes (there could be thousands) is an individual COPY.
I have plans to write an update to URIRef
(or all Identifier
classes) that does not subclass from str
. Instead it could be a wrapper, with a private property slot to hold a reference to a string it was passed in the constructor. That way no copy of the source string is made, and it allows the source string to be an interned string if applicable too. Then the __str__
method would simply return unchanged the wrapped string, again avoiding the additional copy.
After big update is in place, I plan to add a new feature in DefinedNamespace for the ability for some popular Namespaces to have their most popular members available as pre-interned instances, So it doesn't need to make a copy of the string, and doesn't even need to make a copy of the URIRef
instance, it can simply always return the same instance.
And finally, I plan to create a big update to MemoryStore (or a new Store backend) that leverages the all-string-keys
python dict optimisation by unwrapping URIRef
and BNode
instances into real strs
before pushing them into the store index.
Technically none of this is a breaking change. The fact URIRef
, BNode
and Literal
are subclass of str
is not necessarily part of the public RDFLib API surface, and changing it wouldn't typically have any user-facing changes, but its one of the things that has been that way for so long there is certainly going to be software out there that relies on it to work exactly a certain way. So I think it makes sense to delay it to the "big-changes" version 8.0, where ConjunctiveGraph
is being removed too. This might even be a good opportunity to rename URIRef
to IRI
as suggested here or even NamedNode
as is commonly used in modern projects like Rdf-JS, N3.js, and Oxigraph.
I did a quick sanity check, to see if we would lose anything (performance, compatibility, interoperability, memory usage) by moving away from Identifier
being a subclass of str
.
Simply moving from str
base type to using an inner wrapped string ("inner_") show approx 0.02% memory usage reduction (this is good) however around 10% execution time penalty due to the new indirect call to Identifier.__hash__()
(this is bad).
I'm not sure if any potential gains from the ability to use pre-hashed system-interned strings would offset this 10% performance hit here.
Further testing confirms my earlier findings.
It seems there is good reason to keep Literal
, Identifier
, BNode
etc a subclass of str
. That is because we can directly call str.__hash__()
on the variable. And rdflib does this automatically by setting the __hash__
class function function of Identifier
to be __hash__ = str.__hash__
.
The benefit is that str.__hash__
is a C-function, it is defined in the CPython source code, meaning hash lookups are called directly from the internal dict implementation without touching any python code.
Replacing that with the new class equivalent results in 10% slowdown:
__hash__ = str.__hash__ # old
#---------
def __hash__(self) -> int: # new
return str.__hash__(self.inner_)
I tested with just a minimal hash function on Identifier
.:
def __hash__(self) -> int:
return 0
And even this non-implementation is 10% slower than rdflib's current method. So that means the slowdown is simply the result of calling back into a python function, opposed to the C function.
I also tested other benefits I mentioned in the original post, eg, pre-calculated hashes and using system.intern strings, and these features help, however none of them make up for the 10% overhead incurred by the new __hash__
call.
Maybe this thread can be closed.
Good 'ol things are as good as they can be because we used C in 2003...
Keeping as is due to faster string hashing
There is a particular design choice in RDFLib that has existed since the original version in 2001.
URIRef, BlankNode, and Literal are subclasses of
Identifier
.Identifer
is a combination class, a subclass of bothstr
and the abstract classNode
.It has been this way for a long time. In the Py2->3 transition days,
Identifier
used to subclasssix.text_type
. And before that it was a subclass of Python 2'sunicode
type, right back until the first tracked commit. (Note, python2unicode
type was renamed tostr
for python3, it is literally the same type. Python2str
becamebytes
in python3).Its common to see very old python libraries using a similar pattern. There was good reason. Subclasses of
str
were treated as "real strings" by the system, that enabled better memory management and provided built-in text features likestartswith()
,lower()
, etc, and comparisons likeeq, gt, lt, gte, lte
that the subclass gets for free and work exactly like a native string.Additionally in Python prior to v2.1 "string interning" worked on subclasses of strings. That means that short constant strings, and strings managed using
sys.intern()
were managed using a global string table, were deduplicated in memory and had shortcuts like pre-calculated hashs. But for Python 2.1+, that was considered a bug and it was fixed so subclasses ofstr
could no longer be interned.One thing to note however, in the Python2 days,
Identifier
wasn't a subclass ofstr
, butunicode
which I don't think had an intern table, so RDFLib never benefited from that.That pattern in RDFLib has been maintained for 21 years without question, because "its always been that way", but I suggest it is hurting RDFLib's usability and performance. The main cause being an enormous amount of needless copying strings that happens when creating URIRef, BlankNode, and Literal instances, and reading them back out.
Examples:
Firstly, when you create a URIRef (or any subclass of Identifier) and pass in a
str
, the__new__
constructor will hand that directly into the internal string constructor (return str.__new__(input)
) however python knows its really part of a subclass of str, so it doesn't use the copy-on-write (CoW) deduplication optimisation, instead it takes a COPY of the string passed in, and saves it.But then there is now no way of getting back that string it copied and saved. If you want to serialize that
URIRef
or treat it like a string you can read from, you need to callstr(myuri)
on it. But that doesn't give you back the original string, or even the first copy it made. It makes a new COPY of the internal string. Even if you try to treat the variable as a real string, callingstr.__str__(myuri)
does the same thing, it makes another new copy. Finally, if you try to tap into the innerstring usingsuper().__str__()
it gives you the same result. There is no way to read the inner contents of a subclass of string without copying it.Python3 can still do intern strings. It is a major memory management and performance speedup. All static constant strings in your code that are 20 characters or less will be saved by the parser as interned strings, and will always refer to the same object.
and for a long string:
You can see each of the different variables and new strings that were created with the interned string are instances of the exact same string in memory.
RDFLib cannot take advantage of this however, because string interning DOES NOT work on subclasses of
str
.The reason is to do with immutability. Python treats all real strings as immutable. Any operation you do on a string will leave the original string untouched and give you back a new string. The guarantee of string immutability is something provided by the stringlib type interface in stdlib. That is why strings can be interned, identical interned strings will return the same object because they are immutable. However Python can't make those same guarantees about immutability on subclasses of strings. A subclass can change its internal contents while identifying as the same instance. That is why python must COPY the source string. Passing a real
str
intoURIRef()
constructor takes a copy of the original string, it can't use an internal reference to the original string because the copy is mutable, it needs to retain the original as immutable. Similarly when reading it back out as astr
, you can't simply read out the internal state as a string itself, because python needs to make an immutable snapshot of the subclass, it does that by taking a COPY.I was half way through making a PR to improve the performance and memory usage of URIRef creation and serialization by taking advantage of
sys.intern()
for frequently used URIRefs, when I started putting all the pieces of this issue together.It looks like RDFLib will need a fundamental shift in architecture of its most used foundation classes if we want to resolve this, so I think it will be a good issue to tackle for v8.0.