RDFLib / rdflib

RDFLib is a Python library for working with RDF, a simple yet powerful language for representing information.
https://rdflib.readthedocs.org
BSD 3-Clause "New" or "Revised" License
2.17k stars 557 forks source link

Explicit sys.intern for Identifier: Faster comparisons and dict lookup, less duplicated strings. #1302

Closed rchateauneu closed 2 years ago

rchateauneu commented 3 years ago

Python offers an interesting feature, sys.intern(string), which inserts string in the table of “interned” strings, useful to gain a little performance on dictionary lookup. If the keys in a dictionary are interned, and the lookup key is interned, the key comparisons (after hashing) can be done by a pointer compare instead of a string compare.

https://stackoverflow.com/questions/1136826/what-does-sys-intern-do-and-when-should-it-be-used

"Instead of creating a new copy of string every time, this optimization method dictates to keep just one copy of string for every appropriate immutable distinct value and use the pointer reference wherever referred" https://arpitbhayani.me/blogs/string-interning

Only short and compile-time strings are automatically interned: http://guilload.com/python-string-interning/

It seems that it fits the use case of reading long strings from a file and inserting them in several dicts. The method Memory.add() used to insert a triple in a memory store might used up to 40 % of CPU when deserializing a NT text file. This also applies to SimpleMemory.add()

Its structure is made of three dicts of dicts:

Each triple read from a file is split into its individual strings, inserted in the three dicts. The same strings are found in many different triples, different copies are created, and compared with hash and ultimately string comparisons.

Keys of any dictionary object are interned, which is the case here (So I do not expect a fundamental change), but it might be worth to do that in advance, so the dictionary would not need this extra-lookup.

A possible implementation would be to "intern" all identifiers when they are created:

class Identifier(Node, str):  # allow Identifiers to be Nodes in the Graph
    def __new__(cls, value):
        value = sys.intern(value) # New code. Maybe the string is in the interned cache.
        return str.__new__(cls, value)    

Theoretically, it could divide by three the number of dict lookup done in Memory.add. Therefore, a more local implementation would be to "intern" the strings only in Memory.add(), but the benefit would not be global.

It seems to be already suggested in 2010, notation3.RDFSink.intern(), but it does not do anything as far as I can see. This would be removed of course.

Any opinion please ?

ghost commented 2 years ago

Nice idea but for the above implementation, run_tests reveals an Exception in the ntriples parser:

    value = sys.intern(
TypeError: can't intern BNode