etsap-TIMES / xl2times

Open source tool to convert TIMES models specified in Excel
https://xl2times.readthedocs.io/
MIT License
10 stars 7 forks source link

XLSX cache improvements #199

Open siddharth-krishna opened 4 months ago

siddharth-krishna commented 4 months ago

196 introduced a cache of the EmbeddedXlTables extracted from XLSX files to work around openpyxl's slow reading of certain Excel files. There are some ways it can be improved:

@SamRWest you're right, I wasn't sure where to put the cache directory. I'm also used to caches living in ~/.cache/ but then I wasn't sure how that translated to Windows.

SamRWest commented 4 months ago

I'm also used to caches living in ~/.cache/ but then I wasn't sure how that translated to Windows.

This has been standard on Windows for a while now too, which is handy. ~ ends up being c:\Users\<username> and pathlib.Path.home() is a cross-platform equivalent to ~.

Also check file modification time to be ultra sure there are no hash collisions?

If the file has been modified but the contents (and thus its content-only hash) hasn't changed, so this is probably unnecessary. The chance of an accidental collision of the file contents is infinitesimal.

Re: parquet - your pickle mechanism is actually very fast, and now I've used it for a bit, the cache is small enough that I doubt this will really become a problem, as long as we clean up old files at some point.