This speeds up scanning by building an in-memory prefix tree, then generating a single iterable that uses executemany to insert all the rows in a single transaction. The results aren't too noticeable on smaller sets, but makes larger sets dramatically faster.
A side effect of this is that entries that may have previously been missed under an unsplittable root are now correctly marked as such.
For some rough performance testing, I used two real file sets: a small one (18k files) and large one (500k), and measured the wall time of the run and the max RSS usage via time -v. The find command mentioned below is find <directory> -depth -type f -printf %s:%p\\n>/dev/null, to provide a reference for the "ideal" baseline. All scans were run three times and only the best values were picked, to account for caching.
This speeds up scanning by building an in-memory prefix tree, then generating a single iterable that uses
executemany
to insert all the rows in a single transaction. The results aren't too noticeable on smaller sets, but makes larger sets dramatically faster.A side effect of this is that entries that may have previously been missed under an unsplittable root are now correctly marked as such.
For some rough performance testing, I used two real file sets: a small one (18k files) and large one (500k), and measured the wall time of the run and the max RSS usage via
time -v
. Thefind
command mentioned below isfind <directory> -depth -type f -printf %s:%p\\n>/dev/null
, to provide a reference for the "ideal" baseline. All scans were run three times and only the best values were picked, to account for caching.find