Closed giovannipizzi closed 3 years ago
According to this SQLite page, the following SQL can be used to estimate how 'fragmented' the DB is and if you need to call VACUUM
:
CREATE TEMP TABLE s(rowid INTEGER PRIMARY KEY, pageno INT);
INSERT INTO s(pageno) SELECT pageno FROM dbstat ORDER BY path;
SELECT sum(s1.pageno+1==s2.pageno)*1.0/count(*)
FROM s AS s1, s AS s2
WHERE s1.rowid+1=s2.rowid;
DROP TABLE s;
As a reference here is the value on the big SDB DB (~6.8M entries, 1.2GB) before VACUUMING:
To vacuum a Container c
this can be done:
engine = c._get_cached_session().get_bind()
engine.execute("VACUUM")
When creating a DB a bit at a time, the index is scattered on the file. On a big DB, if the DB file is not in the OS disk cache, it means a huge performance hit (~1000it/s instead of 600000it/s), e.g. on any listing using the covering index on the hash key (e.g.
SELECT hashkey from db_object ORDER BY hashkey
). E.g. test to perform the query above right after flushing the caches withsudo su -c "echo 3 > /proc/sys/vm/drop_caches"
:cat packs.idx > /dev/null
will make all the rest of the operations (like listing ordered by hashkey) fast again (600000it/s vs 1000it/s)VACUUM
on the DB, as this will defragment the DB and the indexes (note: the content of the SQLite file, not how it's written on the filesystem). Also, this is needed when deleting entries to recall space. Note that the first iteration onSELECT hashkey from db_object ORDER BY hashkey
will run at ~270000it/s instead of the later times (with caching) when it will run at 600000it/s but it is already fast enough. Also, looking at how much data goes into the disk OS cache, it seems that it only actually needs to read and keep in the cache 550MB out of 1.2GB of the sqlite file (probably, the size of the index onhashkey
).SELECT count(*) from db_object
decides to use the index on the hash key that, if the DB is 'fragmented', it's very slow as above (it's much faster to count the length in python ofSELECT id from db_object ORDER BY id
). Therefore we can:cat
on the whole file to pre-fetch the data (but only if it all fits in RAM!)Mentioning also #92 as the performance when looping over sorted results by hash key will be strongly dependent on either caching the file, or VACUUMing it first (that, however, is slow)