create_id_breakdown is a bottleneck and needs rewriting

exponential-decay / demystify

Engine for analysis of Siegfried export files and DROID CSV. The tool has three purposes, break the export into its components and store them within a SQLite database; create additional columns to augment the output where useful; and query the SQLite database, outputting results in a readable form useful for analysis by researchers and archivists within digital preservation departments in memory institutions. The tool will find duplicates, unidentified files, blacklisted objects, character encoding issues, and more.

zlib License

23 stars 5 forks source link

Given an 8 million line SF YAML, (631,286 row database), create_id_breakdown is taking too long. It is largely unoptimized and not brilliantly written. Any rewrite I believe should bring pretty decent efficiency gains. Lets have a look at what we can do.

Edit: For reference, without this function alone, the script is quicker by over an hour, and completes in 77 seconds. There may be other bottlenecks along the way as much relies on the output here, but one step at a time.

NB. Rewrite could be focused on better sqlite queries which do not seem to be a bottleneck at all. Or it could be focused on improving the data structures we're using.

exponential-decay / demystify

create_id_breakdown is a bottleneck and needs rewriting #77