Closed cmgosnell closed 2 years ago
The main blocking issue here is that something has changed either about how Dask works, or the nature of our Parquet dataset, maybe because of changes to the PyArrow / Apache Arrow libraries. The pudl-dexamples
notebook that deals with EPA CEMS no longer works. The following code used to efficiently serialize reading all the EPA CEMS data in and summarizing annual emissions, by reading one state/year combo at a time. But now it attempts to read in all of the data, and runs out of memory:
import dask.dataframe as dd
epacems_cols = [
"year",
"plant_id_eia",
"co2_mass_tons",
"so2_mass_lbs",
"nox_mass_lbs",
]
epacems_path = pathlib.Path(pudl_settings["parquet_dir"]) / "epacems"
epacems_ddf = (
dd.read_parquet(
epacems_path,
columns=epacems_cols,
engine="pyarrow",
)
.astype({"year": int})
.groupby(by=["year", "plant_id_eia"])
.sum()
)
epacems_annual_emissions = epacems_ddf.compute()
We need to figure out what changed and make it work again. Looking at the Parquet dataset, prior to aggregation it has npartitions==1223
so the dataset is partitioned initially, which I think means that Dask should be smart enough to just do one small chunk of work at a time.
Another weird problem I'm having is that "Run All Cells" sometimes is not running all cells. It'll skip a block in the middle of a notebook and then run more cells down toward the end. But this doesn't seem to be related to our release in particular. It happens when I run the pudl-examples
notebooks on my own computer without using Docker too. If I run them in the "classic" notebook rather than Jupyterlab it doesn't happen though.
Use the
databeta.sh
script to generate a Docker-ized data release that provides access to fully processed datasets and a software environment that can be used to work with them, via the image we specify for thepudl-examples
repository.environment.yml
file in thepudl-examples
repo to use newly released PUDL.pudl-example
notebooks on 2i2c JupyterHub to verify that new image + data + hub work