catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 105 forks source link

Add CEMS tests #260

Open karldw opened 5 years ago

karldw commented 5 years ago

Currently, most of the EPA CEMS "tests" are database constraints. Here are some tests we might want to make explicit, particularly if we move away from the database (#258).

Text in italics goes beyond current testing. Are there others you'd like to see?

These tests could be run on travis for small portions of the CEMS data, or outside travis for the whole set.

zaneselvans commented 4 years ago

Hey @karldw, I guess this isn't going to make it into the first data release, but I'm curious if you have any thoughts on how we might implement these tests in a system where the CEMS isn't getting loaded into a DB, but is being used only via the partitioned parquet dataset. In particular the uniqueness of composite primary keys seems like it might be hard since it has to take into account the entire dataset. Have you been using parquet + dask for your work with CEMS? Is there an easy way to have it do all of these tests as part of the data validation process we have set up to run with Tox, without needing to read the data in more than once? Or should we be doing this stuff as the data passes through the ETL process somehow?

karldw commented 4 years ago

Testing in Tox should be pretty doable!

Dask tries to minimize the amount of duplicated work, particularly if you compute all the results at the same time with dask.compute(...). Probably more important is only working with the required columns. For example, "Measurement codes are in a small, known set of strings" could read only the measurement code columns, and would be the only test that reads those columns).

The uniqueness of primary keys is probably the most intensive, but should still be possible with code like this (untested). You could also limit to only one state to cut down on processing time.


import dask
import dask.dataframe as dd

def test_cems_primary_keys():
    cems_files = "my_data_pkg/parquet/epacems/year=*/state=*/*"
    key_columns = ["plant_id_eia", "unitid", "operating_datetime_utc"]
    df = dd.read_parquet(cems_files, columns=key_columns)
    # Could also check for NAs in here
    row_count, unique_row_count = dask.compute(df.shape[0], df.drop_duplicates().shape[0])
    assert row_count == unique_row_count

Some of these checks, like missingness or measurement codes, would be straightforward to test as part of the ETL. On the other hand, it might be conceptually cleaner to run all the tests on the final output parquet files. Up to you all!