RDFLib / rdflib-sqlalchemy

RDFLib store using SQLAlchemy dbapi as back-end
Other
148 stars 34 forks source link

size of sqlite database #66

Open rokroskar opened 4 years ago

rokroskar commented 4 years ago

Hi, thanks for this very useful rdflib plugin! I am running some tests and comparisons and am noticing that using sqlite results in very large db sizes. I have a graph of ~10k triples and it serializes on disk to ~2MB using rdf-xml and a sqlite db of almost 14MB - is this expected? Or is there some setup step I'm missing that would make the db more reasonable? Thanks!

mwatts15 commented 4 years ago

Hard to say for sure without your source data why your database file is as large as it is. I did a test of adding exactly 10,000 triples with 10,000 distinct subjects, 100 distinct predicates, and 1000 distinct objects and got a DB file size of 4.7MB. Repeated with all distinct sub, pred, obj and that only increased to 4.8 MB. I'm able to increase that to 19MB+ just by using longer URIs. sqlite3 version: 3.32.2 Python version: 3.7.4 rdflib-sqlalchemy version: 0.4.0

rokroskar commented 4 years ago

Thanks for the quick response @mwatts15 - there certainly might be some long(ish) URIs in my data. I'm wondering if there are any indexing options available to mitigate this problem?

mwatts15 commented 4 years ago

If by options you mean a flag you can specify that will create a table mapping strings to more compact identifiers, there is no such thing in rdflib-sqlalchemy, nor, as far as I have seen, is there any sqlite extension that does something similar. If you would like to implement such a feature, I would certainly be open to merging it.