Add CEMS tests - Githubissues

karldw commented 5 years ago

Currently, most of the EPA CEMS "tests" are database constraints. Here are some tests we might want to make explicit, particularly if we move away from the database (#258).

[ ] Column names and types are as expected
[ ] Plant ID, unit ID, and timestamp uniquely identify rows
[ ] Plant ID, unit ID, timestamp, state, gross load, and heat content are never missing
[ ] Mass and rate measurements are in a reasonable range (0 to some upper bound) when not missing
[ ] Measurement codes are in a small, known set of strings
[ ] Parquet partitioning works for default and non-default splits (default is state-year)
[ ] All plant IDs are in the plants_entity_eia table (this is enforced by the timezone calculation anyway), and gives an informative error on fixing the problem if not
[ ] Processing CEMS raises an informative error if plants_entity_eia doesn't exist

Text in italics goes beyond current testing. Are there others you'd like to see?

These tests could be run on travis for small portions of the CEMS data, or outside travis for the whole set.

zaneselvans commented 4 years ago

Hey @karldw, I guess this isn't going to make it into the first data release, but I'm curious if you have any thoughts on how we might implement these tests in a system where the CEMS isn't getting loaded into a DB, but is being used only via the partitioned parquet dataset. In particular the uniqueness of composite primary keys seems like it might be hard since it has to take into account the entire dataset. Have you been using parquet + dask for your work with CEMS? Is there an easy way to have it do all of these tests as part of the data validation process we have set up to run with Tox, without needing to read the data in more than once? Or should we be doing this stuff as the data passes through the ETL process somehow?

karldw commented 4 years ago

Testing in Tox should be pretty doable!

Dask tries to minimize the amount of duplicated work, particularly if you compute all the results at the same time with dask.compute(...). Probably more important is only working with the required columns. For example, "Measurement codes are in a small, known set of strings" could read only the measurement code columns, and would be the only test that reads those columns).

The uniqueness of primary keys is probably the most intensive, but should still be possible with code like this (untested). You could also limit to only one state to cut down on processing time.


import dask
import dask.dataframe as dd

def test_cems_primary_keys():
    cems_files = "my_data_pkg/parquet/epacems/year=*/state=*/*"
    key_columns = ["plant_id_eia", "unitid", "operating_datetime_utc"]
    df = dd.read_parquet(cems_files, columns=key_columns)
    # Could also check for NAs in here
    row_count, unique_row_count = dask.compute(df.shape[0], df.drop_duplicates().shape[0])
    assert row_count == unique_row_count

Some of these checks, like missingness or measurement codes, would be straightforward to test as part of the ETL. On the other hand, it might be conceptually cleaner to run all the tests on the final output parquet files. Up to you all!

catalyst-cooperative / pudl

Add CEMS tests #260