Cache extracted EmbeddedXlTables based on xlsx file hashes

siddharth-krishna commented 7 months ago

This PR adds a feature to cache the extracted EmbeddedXlTables from input XLSX files by default (and removes the --use_pkl flag). This is because on some bigger benchmarks openpyxl takes ages to extract the tables, and the xl2times developer workflow rarely involves changing the XLSX files.

The cache is stored in xl2times/.cache/ and saves files based on the hash of the XLSX file. It also checks filename (and can be extended to check modification time) to avoid hash collisions. Another future improvement can be to have a maximum size for the cache directory.

~Change in runtime appears to be about 50% on CI (note that for this PR's CI, the cache is created when the tool is run first on the PR branch, and then the cache is used when it is run on main):~ (See comment below for more accurate performance implications)

There's also a new flat --no_cache to ignore the cache directory and read the files from scratch.

@SamRWest just tagging you so you are aware of this (potentially confusing) behaviour change.

siddharth-krishna commented 7 months ago

CI is failing because there is no cache present yet, and regression tests run first on the PR branch and second on the main branch -- the main branch uses the cache created by the PR branch, so is much faster!

When I get time, I'm planning to disable time-based regressions as a CI failure, and instead have CI comment on the PR with the difference in time if it is significant (since time measurements are anyway not super accurate as we run benchmarks in parallel). That way we can see the difference in time and decide whether to ignore / investigate as appropriate.

olejandro commented 7 months ago

Cool, thanks! Do I just merge it then?

siddharth-krishna commented 7 months ago

Not yet, I need to fix the cache in the GitHub Actions

siddharth-krishna commented 7 months ago

It looks like I was mistaken about the performance implications. Here are the total runtimes of running all the benchmarks on my laptop:

main         812s
PR-first-run 871s
PR-secd-run  658s

So the act of hashing and creating the cache in the first run adds 59s (7%) runtime, but once we have the cache, subsequent runs are 152s (19%) faster. I'm going to merge this in, assuming that that tradeoff is useful for now, but happy to discuss whether we want caching to be the default behaviour and iterate on this further.

olejandro commented 7 months ago

Thanks @siddharth-krishna! I believe this will definitely be useful, also when using the tool for scenario analysis. I guess, later on, we could develop this further to skip some of the transforms depending on what's changed?

@SamRWest how long does it take to read AusTIMES?

olejandro commented 7 months ago

In the case of TIMES-US Model (TUSM), reading of the files takes about 6 minutes (ca. 25% of the total processing time). Although process_wildcards takes most of the time, i.e. 63%, caching gives the overall processing time a nice boost!

SamRWest commented 7 months ago

Nice - you beat me to it! I was considering adding basically this feature to speed up my AusTIMES runs :) Thanks!

It might be worth trying caching with parquet files instead of pickles in future, they're usually lightning fast to read/write compared to pickles (although 0.8s is probably fast enough for me :) ), and the file sizes are much smaller. The downside is that you'd have to save one file per table though.

Another potential issue is that the .cache/ dir gets created inside the xl2times module (i.e. beside xl2times/__init__.py). This seems (to me) a bit unusual. Would it be better to create it in the users' home dir (eg. pathlib.Path.home() / '.xl2times/cache)? That seems to be where most apps create their caches now - at least I seem to have lots of caches there.

Speeds for Austimes are: With caching, first run: Extracted (potentially cached) 555 tables, 34360 rows in 0:01:32.051419 With caching, second run: Extracted (potentially cached) 555 tables, 34360 rows in 0:00:00.842077 Nice speedup :)

siddharth-krishna commented 7 months ago

Great to hear this is useful! Thanks for the suggestions, I've moved the discussion to an issue so we don't lose track of it: #199

etsap-TIMES / xl2times

Cache extracted EmbeddedXlTables based on xlsx file hashes #196