exponential-decay / demystify

Engine for analysis of Siegfried export files and DROID CSV. The tool has three purposes, break the export into its components and store them within a SQLite database; create additional columns to augment the output where useful; and query the SQLite database, outputting results in a readable form useful for analysis by researchers and archivists within digital preservation departments in memory institutions. The tool will find duplicates, unidentified files, blacklisted objects, character encoding issues, and more.
http://www.openplanetsfoundation.org/blogs/2014-06-03-analysis-engine-droid-csv-export
zlib License
23 stars 5 forks source link

create_id_breakdown is a bottleneck and needs rewriting #77

Open ross-spencer opened 2 years ago

ross-spencer commented 2 years ago

Given an 8 million line SF YAML, (631,286 row database), create_id_breakdown is taking too long. It is largely unoptimized and not brilliantly written. Any rewrite I believe should bring pretty decent efficiency gains. Lets have a look at what we can do.

Edit: For reference, without this function alone, the script is quicker by over an hour, and completes in 77 seconds. There may be other bottlenecks along the way as much relies on the output here, but one step at a time.

NB. Rewrite could be focused on better sqlite queries which do not seem to be a bottleneck at all. Or it could be focused on improving the data structures we're using.

ross-spencer commented 2 years ago

There's a spare index described here: https://github.com/exponential-decay/sqlitefid/issues/9 that might be worth looking into for performance.