edgi-govdata-archiving / ECHO_modules

ECHO_modules is a Python package for analyzing a copy of the US Environmental Protection Agency's (EPA) Enforcement and Compliance History Online (ECHO) database
GNU General Public License v3.0
3 stars 6 forks source link

Add Clean Water Act Discharge Monitoring Reports for FY2020 as a preset DataSet object #14

Closed ericnost closed 3 years ago

ericnost commented 3 years ago

make_data_sets.py creates objects for storing important info from database queries. We have objects for CWA inspections, violations, greenhouse gas emissions, etc. For the watershed notebook, we need an object to handle the Discharge Monitoring Reports (DMRs) for FY 2020. This will allow us to report trends in actual water pollutant discharges, as reported by regulated facilities (not just whether facilities were in violation or how much they exceeded their permit....we will be able to report ALL discharges). Note: this is maybe the biggest table in our database, so performance may be an issue. Also I think we may need to create a materialized view for it.

Here is an example of an existing DataSet object:

"Greenhouse Gas Emissions": dict(
        echo_type="GHG",
        base_table="POLL_RPT_COMBINED_EMISSIONS",
        table_name="GREENHOUSE_GASES_MVIEW",
        idx_field="REGISTRY_ID",
        date_field="REPORTING_YEAR",
        date_format="%Y", 
        agg_type="sum",
        agg_col="ANNUAL_EMISSION", 
        unit="metric tons of CO2 equivalent"
    )

I suspect the object for the DMRs will look something like this:

"DMRs": dict(
        echo_type="NPDES",
        base_table="NPDES_DMRS_FY2020",
        table_name="XXXXXXX", #we don't have a materialized view of this table, I don't think
        idx_field="EXTERNAL_PERMIT_NMBR",
        date_field="MONITORING_PERIOD_END_DATE",
        date_format="%m/%d/%Y", 
        agg_type="sum",
        agg_col=XXXXXXX, #we need to take a closer look and think through how to summarize this info, since it addresses a vast array of chemicals and differing units of measure
        unit="units" #differing units of measure, which can be found in the LIMIT_UNIT_DESC field
    )

To see what the DMR table looks like, you can view output from this notebook: https://colab.research.google.com/drive/1GmvDCpbDmBERymLlxaVrBkgLzG7t4M90#scrollTo=cfV3jjvmbt7n

As noted, the biggest challenge is how to implement standard aggregation of the info in the table. I suspect what we'll want to do is roll-up by PARAMETER_CODE and then sum DMR_VALUE_NMBR to report on specific pollutants. But I haven't investigated the NPDES_DMRS_FY2020 table enough to know if that's right. Plus, it will depend on what the team decides as the goal of the notebook. Perhaps we want to show the total amount of pollutants permitted in the watershed (which, looking at the spottiness of these DMR records, may be more feasible than showing the total actually released).

shansen5 commented 3 years ago

For the agg_type, agg_col and unit we might have a dictionary for different columns so we can specify different columns as needed. I'll look at the other notebooks and see how that would work for them.

ericnost commented 3 years ago

I think that's a fantastic idea though it may only apply to this DataSet object and it may require changes to the class (to handle a dictionary instead of self.agg_type as string) and to the notebooks?