catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 106 forks source link

Question about multiple SQLite databases #551

Closed briannacote closed 4 years ago

briannacote commented 4 years ago

Hi there,

I'm working on creating an SQLite database that has CEMS data. I already have a different SQLite database with EIA 860 and 923 data. Is there anyway to direct my python scripts to a different "pudl_db". It looks like it currently just pulls the default one which is set from running the ETL process. At least that's my understanding. Thanks for the help!

All the Best, Bri

cmgosnell commented 4 years ago

Hi brianna! Most of our system defaults to using one sqlite db, so we haven't built in functionality to interact with multiple dbs, but depending on your use case, you can certainly use multiple db's. Are you looking to generate multiple db's or use the PudlTabl output functionality w/ multiple db's?

The latter isn't too hard. You just give pudl.output.pudltabl.PudlTabl a pudl_engine using a non-standard path. Our standard way to generate a sqlite engine is: pudl_engine = sa.create_engine(pudl_settings["pudl_db"]), but you can point this to another sqlite db.

If you want to generate multiple dbs, we generally recommend establishing multiple "pudl_work" spaces. Here is the documentation for setting up a work space. The simplest way to toggle between multiple default workspaces is to edit the pudl_in and pudl_out in the .pudl.yml file on your home directory. You can also set pudl_in and pudl_out with several of the scripts - the defaults are generated from the .pudl.yml but if you don't want to continually go back and forth the the yml file the arguments are there for you. I often have two pudl work spaces and I just comment one out when I am using the other.

zaneselvans commented 4 years ago

PUDL is set up to operate within a particular "workspace" -- a hierarchy of directories that includes sqlite, data, datapkg, parquet, settings etc. If you want to work with a new and different set of data, you can create a whole new workspace, by running pudl_setup in a new directory somewhere else on your computer. In practice, if I'm playing with more than one set of things (like, testing ETL outputs) I'll often just rename the sqlite/pudl.sqlite file to something like pudl-2020-03-04.sqlite so I know when it came from, and then all the same automatically generated paths still work.

Or like Christina said, depending on what you're wanting to do with the new database you can potentially just pass its new path in to the appropriate function.

How much CEMS data are you looking to work with? If it's a substantial amount, you may want to use the parquet file output instead (see the epacems_to_parquet script). I tried to load all ~1 billion rows of EPA CEMS into SQLite once, and it took 30+ hours (at which point I killed the process). If you're loading a modest amount of CEMS data, you'll still need to also re-load the EIA 860/923 data (which will be generated in the same datapackage output).

Generally at this point we're recommending folks use the archived data packages from Zenodo if possible -- they include the full collection of possible outputs.

briannacote commented 4 years ago

For now I was just using one year of CEMS data so I will go ahead and just combine it into a single SQLite database with the EIA860 and EIA 923. I could eventually be using more which at that time I'll look into the parquet file output.

I am curious though how I would direct to a different database in my script. Currently, these are the two lines in my script hooking up the the "pudl_db":

pudl_settings = pudl.workspace.setup.get_defaults()
pudl_engine = sa.create_engine(pudl_settings["pudl_db"])

What path would need to change?

Thank you both for the help!

zaneselvans commented 4 years ago

If you look at the contents of the pudl_settings dictionary, most of it is just paths. The sa.create_engine() function takes a database URL as an argument, and for SQLite that's just a path to the local file. So you can swap in whatever path you want there (as long as it actually goes to a PUDL DB file).

briannacote commented 4 years ago

I am set for now and will close this. Much appreciated on the quick responses.