catalyst-cooperative / pudl-catalog

An Intake catalog for distributing open energy system data liberated by Catalyst Cooperative.
https://catalyst.coop/pudl/
MIT License
9 stars 2 forks source link

Automatically disable caching of local data catalog sources #3

Open zaneselvans opened 2 years ago

zaneselvans commented 2 years ago

Reading parquet files which are stored on the local filesystem through the current PUDL catalog still results in caching. This slows things down dramatically, and quickly uses an enormous amount of disk space. Especially in development when we've got data that we've just generated locally it could be nice to be working with it using the same mechanism as remote data (the data catalog), but not if we end up with a bunch of unnecessary caching happening continuously in the background.

Identify a way to disable caching when we're working with local data. Ideally this would be done automatically without the user having to think about it. Maybe it's as simple as making the simplecache:: prefix to urlpath conditional based on the value of PUDL_INTAKE_PATH using Jinja templating features?

If that's not possible then maybe caching can be turned off with an argument that's passed to the data source by the user.

zaneselvans commented 2 years ago

Unclear if or how we can do this, and allowing the user to specify cache_method="" is working okay, so I'm going to toss it in the icebox.