Open karldw opened 5 years ago
Hey @karldw, I guess this isn't going to make it into the first data release, but I'm curious if you have any thoughts on how we might implement these tests in a system where the CEMS isn't getting loaded into a DB, but is being used only via the partitioned parquet dataset. In particular the uniqueness of composite primary keys seems like it might be hard since it has to take into account the entire dataset. Have you been using parquet + dask for your work with CEMS? Is there an easy way to have it do all of these tests as part of the data validation process we have set up to run with Tox, without needing to read the data in more than once? Or should we be doing this stuff as the data passes through the ETL process somehow?
Testing in Tox should be pretty doable!
Dask tries to minimize the amount of duplicated work, particularly if you compute all the results at the same time with dask.compute(...)
.
Probably more important is only working with the required columns. For example, "Measurement codes are in a small, known set of strings" could read only the measurement code columns, and would be the only test that reads those columns).
The uniqueness of primary keys is probably the most intensive, but should still be possible with code like this (untested). You could also limit to only one state to cut down on processing time.
import dask
import dask.dataframe as dd
def test_cems_primary_keys():
cems_files = "my_data_pkg/parquet/epacems/year=*/state=*/*"
key_columns = ["plant_id_eia", "unitid", "operating_datetime_utc"]
df = dd.read_parquet(cems_files, columns=key_columns)
# Could also check for NAs in here
row_count, unique_row_count = dask.compute(df.shape[0], df.drop_duplicates().shape[0])
assert row_count == unique_row_count
Some of these checks, like missingness or measurement codes, would be straightforward to test as part of the ETL. On the other hand, it might be conceptually cleaner to run all the tests on the final output parquet files. Up to you all!
Currently, most of the EPA CEMS "tests" are database constraints. Here are some tests we might want to make explicit, particularly if we move away from the database (#258).
plants_entity_eia
table (this is enforced by the timezone calculation anyway), and gives an informative error on fixing the problem if notplants_entity_eia
doesn't existText in italics goes beyond current testing. Are there others you'd like to see?
These tests could be run on travis for small portions of the CEMS data, or outside travis for the whole set.