GSS-Cogs / databaker

Command line tool to convert spreadsheets to databases, made for the UK's Office for National Statistics.
Other
1 stars 0 forks source link

Issue 12/refactor for performance #15

Closed mikeAdamss closed 3 years ago

mikeAdamss commented 3 years ago

Rewrites the databaker lookup functionality.

We're basically replacing the generic (and terribly scaling) cell lookups with a much faster lookup type based "engine" (so a Directly Engine, a Closest Engine, a Constant Engine) where each "engine" is a class optimised for performance (and scaling) within its given niche.

Rough performance comparisons follow (I say rough because real numbers will vary based on structure, but comparative gains should be along these sorta lines).

rows in sheet . our current branch refactor (this pr) loading the "tabs" into databaker
. 60,000 00:02:02 00:00:26 00:00:30
125,000 00:06:43 00:00:51 00:01:05
250,000 00:26:27 00:01:48 00:02:07
500,000 02:17:41 00:03:54 00:04:20
. 1,000,000 19:00:00 (gave up at) 00:08:18 00:09:40

Also added a bunch of tests, took out some kruft and added a few friendlier exceptions.