catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 105 forks source link

convert testing to data packages #356

Closed cmgosnell closed 4 years ago

cmgosnell commented 4 years ago

Already got started on this with preliminarily transition of travis tests, but it will probably need tweaking and lots of the init_db test will need to be removed.

cmgosnell commented 4 years ago

How are/Are the data validation test working? @alanawlsn @zaneselvans? If I've converted the PudlTabl functions to point to the new sqlite db, do either of you know what else needs to be done to make the validations functional or if there is any special considerations for what y'all have planned for the future with the visualization notebooks that I should consider??

zaneselvans commented 4 years ago

Right now on the python-packaging branch the data validation uses the output objects, so if the SQLite DB supports those objects (and we want the validation to continue using the output objects) then I don't think there's much/anything that has to be done to support the validation?

We spent some time on Friday trying to get Alana's system updated to work with the python-packaging branch, since master is now quite different, especially in the testing realm. I'm not sure if that was ultimately successful or not.

How does one indicate which SQLite DB the PudlTabl output object should refer to?

cmgosnell commented 4 years ago

Right now, I just converted the old pudl_engine to pudl_engine = pudl.output.export.connect_db(testing=testing), which shows up in basically every output function... I also had to pull in the pt into each function as opposed to referencing them at the beginning of each module pt = pudl.output.pudltabl.get_table_meta(). The export.connect_db() uses pudl_settings so that is where the engine info is truly defined.

That is probably not the best end state for those to be in, but it avoids needing to pass the engine and metadata around into each function. Also the export.connect_db() might not be the right place for that function to live. I'm still thinking through any rearranging that may need to be done.

zaneselvans commented 4 years ago

I think we want the database connection information to be stored inside the PudlTabl output object -- any outputs that are being generated need to come from the same database, and that setting is persistent. So, this would mean that the engine would need to be one of the parameters of PudlTabl.__init__() Then instead of passing testing through to each of the external function calls (since we're getting rid of testing entirely) you'd pass on the engine.

Generally I think we should probably be getting rid of all the default connect_db() style functions, since what we are connecting to is more dynamic now.

cmgosnell commented 4 years ago

This is what I had previously suggested.

And on the connect_db() I was thinking the same thing. We can end up using the same function to generate the FERC mirror database and the SQLite pudl db.. It is a slightly different process for postgres. Sometime soon, we should think through how we want to enable data packages -> whatever form.

cmgosnell commented 4 years ago

Here are what I believe are the remaining tasks for this issue: