Improve scan performance and correctness with a prefix tree

kannibalox commented 1 year ago

This speeds up scanning by building an in-memory prefix tree, then generating a single iterable that uses executemany to insert all the rows in a single transaction. The results aren't too noticeable on smaller sets, but makes larger sets dramatically faster.

A side effect of this is that entries that may have previously been missed under an unsplittable root are now correctly marked as such.

For some rough performance testing, I used two real file sets: a small one (18k files) and large one (500k), and measured the wall time of the run and the max RSS usage via time -v. The find command mentioned below is find <directory> -depth -type f -printf %s:%p\\n>/dev/null, to provide a reference for the "ideal" baseline. All scans were run three times and only the best values were picked, to account for caching.

branch	time (small)	time (large)	max RSS (small)	max RSS (large)
`find`	0m 0.06s	0m 1.85s	N/A	N/A
master	0m 1.30s	7m 45.35s	151 MiB	1217 MiB
scan-performance	0m 1.15s	0m 30.29s	158 MiB	741 MiB

newadventure079 commented 1 year ago

@JohnDoee Can we get this merged soon?

awinnpii commented 9 months ago

@kannibalox I've implemented this and #45 locally and it's working great!

JohnDoee / autotorrent2

Improve scan performance and correctness with a prefix tree #41