catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
458 stars 105 forks source link

Do a Zenodo data release that corresponds to 0.4.0 #697

Closed cmgosnell closed 2 years ago

cmgosnell commented 4 years ago

Use the databeta.sh script to generate a Docker-ized data release that provides access to fully processed datasets and a software environment that can be used to work with them, via the image we specify for the pudl-examples repository.

zaneselvans commented 2 years ago

The main blocking issue here is that something has changed either about how Dask works, or the nature of our Parquet dataset, maybe because of changes to the PyArrow / Apache Arrow libraries. The pudl-dexamples notebook that deals with EPA CEMS no longer works. The following code used to efficiently serialize reading all the EPA CEMS data in and summarizing annual emissions, by reading one state/year combo at a time. But now it attempts to read in all of the data, and runs out of memory:

import dask.dataframe as dd

epacems_cols = [
    "year",
    "plant_id_eia",
    "co2_mass_tons",
    "so2_mass_lbs",
    "nox_mass_lbs",
]

epacems_path = pathlib.Path(pudl_settings["parquet_dir"]) / "epacems"

epacems_ddf = (
    dd.read_parquet(
        epacems_path,
        columns=epacems_cols,
        engine="pyarrow",
    )
    .astype({"year": int})
    .groupby(by=["year", "plant_id_eia"])
    .sum()
)
epacems_annual_emissions = epacems_ddf.compute()

We need to figure out what changed and make it work again. Looking at the Parquet dataset, prior to aggregation it has npartitions==1223 so the dataset is partitioned initially, which I think means that Dask should be smart enough to just do one small chunk of work at a time.

zaneselvans commented 2 years ago

Another weird problem I'm having is that "Run All Cells" sometimes is not running all cells. It'll skip a block in the middle of a notebook and then run more cells down toward the end. But this doesn't seem to be related to our release in particular. It happens when I run the pudl-examples notebooks on my own computer without using Docker too. If I run them in the "classic" notebook rather than Jupyterlab it doesn't happen though.