Do a Zenodo data release that corresponds to 0.4.0

cmgosnell commented 4 years ago

Use the databeta.sh script to generate a Docker-ized data release that provides access to fully processed datasets and a software environment that can be used to work with them, via the image we specify for the pudl-examples repository.

[x] Update environment.yml file in the pudl-examples repo to use newly released PUDL.
[x] Let the CI build a new docker image
[x] Pull that Docker image locally and update databeta script to refer to it.
[x] Generate a new databeta tarball to archive
[x] Extract that tarball locally and try to run it following the included instructions
[x] Run all of the example notebooks to verify that the packaged up data + software work together.
[x] Update Docker image on 2i2c JupyterHub
[x] Upload the new archive to Zenodo
[x] Upload the new data to 2i2c JupyterHub
[x] Organize new data on 2i2c JupyterHub
[x] Run all the pudl-example notebooks on 2i2c JupyterHub to verify that new image + data + hub work
[x] Fill in metadata and README for 2.0.0 data release on Zenodo

zaneselvans commented 2 years ago

The main blocking issue here is that something has changed either about how Dask works, or the nature of our Parquet dataset, maybe because of changes to the PyArrow / Apache Arrow libraries. The pudl-dexamples notebook that deals with EPA CEMS no longer works. The following code used to efficiently serialize reading all the EPA CEMS data in and summarizing annual emissions, by reading one state/year combo at a time. But now it attempts to read in all of the data, and runs out of memory:

import dask.dataframe as dd

epacems_cols = [
    "year",
    "plant_id_eia",
    "co2_mass_tons",
    "so2_mass_lbs",
    "nox_mass_lbs",
]

epacems_path = pathlib.Path(pudl_settings["parquet_dir"]) / "epacems"

epacems_ddf = (
    dd.read_parquet(
        epacems_path,
        columns=epacems_cols,
        engine="pyarrow",
    )
    .astype({"year": int})
    .groupby(by=["year", "plant_id_eia"])
    .sum()
)
epacems_annual_emissions = epacems_ddf.compute()

We need to figure out what changed and make it work again. Looking at the Parquet dataset, prior to aggregation it has npartitions==1223 so the dataset is partitioned initially, which I think means that Dask should be smart enough to just do one small chunk of work at a time.

zaneselvans commented 2 years ago

Another weird problem I'm having is that "Run All Cells" sometimes is not running all cells. It'll skip a block in the middle of a notebook and then run more cells down toward the end. But this doesn't seem to be related to our release in particular. It happens when I run the pudl-examples notebooks on my own computer without using Docker too. If I run them in the "classic" notebook rather than Jupyterlab it doesn't happen though.

catalyst-cooperative / pudl

Do a Zenodo data release that corresponds to 0.4.0 #697