RDFLib / rdflib

RDFLib is a Python library for working with RDF, a simple yet powerful language for representing information.
https://rdflib.readthedocs.org
BSD 3-Clause "New" or "Revised" License
2.17k stars 558 forks source link

rdflib 5.0.0 to 6.2.0: pickled graph size increased #2184

Open cbartz opened 1 year ago

cbartz commented 1 year ago

We currently use rdflib to process our thesaurus STW. We use the Graph.parse method to parse the corresponding RDF XML and the resulting graph is pickled (either by pickle from stdlib or with joblib) when saving different types of machine learning models/objects.

I noticed that the size of the models increased a lot when updating from 5.0.0 to 6.2.0. I could recreate this behavior with a simple script:

from pathlib import Path
from pickle import dump

from rdflib import Graph

STW_PATH = Path("/path/to/stw_9.12.rdf")
OUTPUT = Path("/tmp/output_x.y.z.pickle")

if __name__ == '__main__':
    g = Graph()
    g.parse(str(STW_PATH))

    with OUTPUT.open("wb") as f:
        dump(g, f)
$ ll -h /tmp/*.pickle
-rw-rw-r-- 1 1000 1000 7,2M Dez 20 09:33 /tmp/output_5.0.0.pickle
-rw-rw-r-- 1 1000 1000  21M Dez 20 09:35 /tmp/output_6.2.0.pickle

The size of the stw RDF file is 15 MB.

I wanted to report this behavior as it is cleary a step backwards in terms of disk space used and may be relevant to others as well. Although I'm not sure if the method of serialization by pickling is supported by you. I could only find one chapter in your docs about saving RDF in human readable formats. However, loading a pickled graph is much faster than parsing a graph, which is relevant when using a graph in a production system (where launch times matter).

mielvds commented 1 year ago

We also used to pickle Graph objects indirectly through the Prefect workflow framework and it failed to serialize some graphs. We didn't find out why yes/no though; we now use the standard RDF serialization formats, which is indeed slow. I would be interested in contributing an optimized pickle implementation, but not sure what's needed...