edgi-govdata-archiving / Environmental_Enforcement_Watch

General planning for EEW events
GNU General Public License v3.0
2 stars 3 forks source link

Cataloguing data errors #96

Open Frijol opened 4 years ago

Frijol commented 4 years ago

We've mentioned cataloguing data errors before but I don't think we have a place to list them. For now, collecting here:

┆Issue is synchronized with this Trello card

shapironick commented 4 years ago

The industry-specific data: sic and NAICS does are often missing and sometimes wrong. so looking at in enforcement in a single industry is very very labor intensive.

We've found upwards of 12 FRS id numbers for a single prison. they aren't cross referenced, we just had to find them all and compile them --is that what you mean by disaggregations?

once we get the prison data retrieval from the SBU database working we'll be able to have validated lat/long next to EPA lat/long, so we could quantify the inaccuracies.

Frijol commented 4 years ago

"1" is a valid zip code according to the database

Would be interesting to create a notebook just for identifying data errors

ericnost commented 4 years ago

Adding this, from Leif:

"However, while increased scrutiny has been given to the quality of data in the national systems in recent years, the data quality before November 2000 has not been assessed and should be considered unknown." https://echo.epa.gov/tools/data-downloads <- We should really consider modifying the notebooks to start in 2001 ->

"State/Local penalties were not tracked until Docket was replaced by ICIS in 2002." https://echo.epa.gov/tools/data-downloads/icis-fec-download-summary

ericnost commented 4 years ago

This is an important resource: https://echo.epa.gov/resources/echo-data/about-the-data

ericnost commented 4 years ago

From the above ^^^ "Enforcement actions that have not been concluded are not displayed in ECHO"

ericnost commented 4 years ago

See also: https://echo.epa.gov/resources/echo-data/known-data-problems Perhaps we are trying to add to this/make it more accessible.

ericnost commented 4 years ago

Relevant to the CD LA-2 report: "In some circumstances, Clean Air Act enforcement actions taken by the state of Louisiana have been reported to EPA in duplicate (meaning that one enforcement action and penalty will appear twice on ECHO reports). Louisiana is working to correct this information. (Updated February 2018)" @calderEDJ

ericnost commented 4 years ago

More on geocoding errors....

so I ran the facility-all-programs notebook that <@U4C7QT800> linked above for LA D2 and got some really weird results in the map: including one facility in texas and another in kentucky? View in Slack

ericnost commented 4 years ago

There are 52 EPA facilities with FAC_COUNTY listed as METROPOLITAN BOSTON and tracked in the ECHO database, but there is no METROPOLITAN BOSTON county in reality.

shapironick commented 4 years ago

This one made me LOL

Frijol commented 4 years ago

cc @mraisle @leiffredrickson that this collection of data errors exists. There's also this document Principles for Open Environmental Data and I think some things are tracked in the EEW Tracker

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in seven days if no further activity occurs. If it should not be closed, please comment! Thank you for your contributions.

Frijol commented 3 years ago

Still want to do this I think!

shapironick commented 3 years ago

Would this be a place to list discrepancies between SBU Data and ECHO web interface or would that be somewhere else?

ericnost commented 3 years ago

@shapironick I think here would be the best place - not sure there's anywhere else!

shapironick commented 3 years ago

okie great! this is just the one i mentioned in meeting last week: FRS ID 110013272680. The formal action that is listed on the Detailed Facility Report (https://echo.epa.gov/detailed-facility-report?fid=110013272680) isn't listed in the enforcement actions that we were able to query via SBU database. I'm not sure why it's not listed.

ericnost commented 3 years ago

@shapironick Hmm...interesting that that facility has no address too...

Ok, so I think I know what the problem is. Here we go:

First, I can confirm that when I do my own query of the SBU database, with a different notebook, I don't see this formal action either.

Note that the formal action was led by US EPA and is being pulled from the ICIS data system, not the SDWIS data system. So that's why it's not showing up in the notebook/database. Because it's not in the SDWA_Enforcements table in the SBU database, which is based on SDWIS.

Instead, I suspect the formal enforcement action record resides in the ICIS FEC tables: https://echo.epa.gov/tools/data-downloads/icis-fec-download-summary We have these in the database, but haven't done much with them.

I will check this table to confirm.

Another great example of the fragmented data infrastructure we have to work with.

ericnost commented 3 years ago

Part of the challenge with working with the ICIS FEC tables - which are kind of like, a bunch of random other enforcement actions not described elsewhere - is that they are not organized by facility. There is no way to look at the CASE_ENFORCEMENTS table by facility. You have to query another table to get the enforcement action IDs associated with the facility, then ping the actual CASE_ENFORCEMENTS table for the detailed records.

ericnost commented 3 years ago

Indeed, I found the formal action in the CASE_ENFORCEMENTS table:

ACTIVITY_ID ACTIVITY_NAME STATE_CODE REGION_CODE FISCAL_YEAR CASE_NUMBER CASE_NAME ACTIVITY_TYPE_CODE ACTIVITY_TYPE_DESC ACTIVITY_STATUS_CODE ACTIVITY_STATUS_DESC ACTIVITY_STATUS_DATE LEAD CASE_STATUS_DATE DOJ_DOCKET_NMBR ENF_OUTCOME_CODE ENF_OUTCOME_DESC TOTAL_PENALTY_ASSESSED_AMT TOTAL_COST_RECOVERY_AMT TOTAL_COMP_ACTION_AMT HQ_DIVISION BRANCH VOLUNTARY_SELF_DISCLOSURE_FLAG MULTIMEDIA_FLAG ENF_SUMMARY_TEXT
3600738616 Texas Department of Criminal Justice 6 2016 06-2016-1212 Texas Department of Criminal Justice AFR Administrative - Formal CLS Closed 10/25/17 EPA 10/25/17   ECN Final Order No Penalty       6EN-W N   On May 26, 2016, EPA Region 6 issued an Administrative Order(AO) for Plan Implementation (AO Plan Implementation) to the Texas Deparment of Criminal Justice (Respondent). Respondent has requested additional time to complete the construction schedule. The AO incorporates the new construction schedule deadline extension request submitted by the Respondent in their compliance plan. On July 19, 2011, EPA Region 6 issued the original AO under its authority pursuant to the Safe Drinking Water Act to Respondent for violations at its water system in Harlingen, TX. The Original AO was issued in response to violations of the arsenic maximum contaminant level (MCL). The Original AO required Respondent to comply immediately with the referenced MCL or to submit to EPA 1) a detailed plan to meet the arsenic MCL; 2) a cost analysis of the proposed plan; and 3) a construction schedule. This AO extends the compliance deadline to June 13, 2017 and supersedes the AO issued on July 19, 2011.

Seems like we need to figure out ways of querying this table. I think it would involve pinging the CASE_FACILITIES table for the REGISTRY_ID field (which links to the main ECHO_EXPORTER table), and then grabbing the ACTIVITY_ID (s) or CASE_NUMBER (s)

I tested this approach:

sql = 'select * from "CASE_FACILITIES" where "REGISTRY_ID" = \'110013272680\''

This returns TWO actions, the one from 2016 and also one from 2011 (remember that ECHO.EPA.GOV only displays enforcement actions from the previous 5 years....our database has the full archive)

To get the full record then we grab the ACTIVITY_ID (s)

sql = 'select * from "CASE_ENFORCEMENTS" where "ACTIVITY_ID" = \'2600045351\'' #the one from 2011

  ACTIVITY_ID ACTIVITY_NAME STATE_CODE REGION_CODE FISCAL_YEAR CASE_NUMBER CASE_NAME ACTIVITY_TYPE_CODE ACTIVITY_TYPE_DESC ACTIVITY_STATUS_CODE ACTIVITY_STATUS_DESC ACTIVITY_STATUS_DATE LEAD CASE_STATUS_DATE DOJ_DOCKET_NMBR ENF_OUTCOME_CODE ENF_OUTCOME_DESC TOTAL_PENALTY_ASSESSED_AMT TOTAL_COST_RECOVERY_AMT TOTAL_COMP_ACTION_AMT HQ_DIVISION BRANCH VOLUNTARY_SELF_DISCLOSURE_FLAG MULTIMEDIA_FLAG ENF_SUMMARY_TEXT
0 2600045351 TDCJW Pack Unit 6 2011 06-2011-1355 TDCJW Pack Unit AFR Administrative - Formal CSU Closed Superseded 5/26/16 EPA 5/26/16   ESU Superseded by Another Enforcement Action   6EN-W N   Administrative order issued for non compliance with the requirements of the Act and implementing regualtions including the maximum Contaminant Levels (MCLs) for arsenic.
ericnost commented 3 years ago

@shapironick - I've lost the link to the notebook Steve wrote for you, but if you share it with me, I could try adding in a section that looks for these other enforcement actions.

shapironick commented 3 years ago

Amazing!! thank you, Eric! This is so helpful.

Here is shane's notebook https://colab.research.google.com/github/shansen5/CA_Prisons/blob/master/CA_Prisons2.ipynb#scrollTo=MhTd2hLBpUJT

do you think that the ICIS FEC table is mutually exclusive with other violations? or would there be the possibility of duplication?

ericnost commented 3 years ago

@shapironick Yes, I think there could be duplicates. According to the metadata: "ICIS may contain duplicates of actions and penalties shown in the program databases (i.e., ICIS-Air, ICIS-NPDES, and RCRAInfo)." https://echo.epa.gov/tools/data-downloads/icis-fec-download-summary

ericnost commented 3 years ago

Ok, I've made some edits to @shansen5's notebook - the principles are there and it'll work for a small number of facilities at this point, but hopefully Steve can adjust for looking at a greater number

shapironick commented 3 years ago

Thank you so much, Eric!

On Wed, Feb 3, 2021 at 2:44 PM Eric Nost notifications@github.com wrote:

Ok, I've made some edits to @shansen5 https://github.com/shansen5's notebook - the principles are there and it'll work for a small number of facilities at this point, but hopefully Steve can adjust for looking at a greater number

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/edgi-govdata-archiving/EEW_Planning/issues/96#issuecomment-772877288, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZH33ZRB6NAX4TWD4LVSB3S5HGQVANCNFSM4NFHFRIA .

-- Nicholas Shapiro Assistant Professor UCLA Institute for Society and Genetics Office: (310) 206-2366 Book my office hours here https://calendly.com/n-s/office-hours-monday-1-30-3-30?

ericnost commented 3 years ago

One thing I have tried to do is to look at the percent of facilities that have no available information (blank, null or NaN values in our database). I tested this on the summary Clean Water Act information available in the main ECHO Exporter table. I compared facilities in majority white neighbourhoods with those in majority minority neighborhoods: https://colab.research.google.com/drive/1Ej7d2ovZZx-mQz3wuPzrKWlPa5ZBY-l7 This shows that, across the US, the vast majority of CWA-regulated facilities have no recorded information about # of inspections, violations, enforcement actions, etc. It could be that's because they have no violations / enforcement actions. But reporting null is different from reporting a verified 0! This also shows that for the CWA, facilities located in majority minority neighbourhoods tend to have less certain information. This is especially true for inspections: Percent of facilities with no info on CWA_INSPECTION_COUNT 87.68 (minority) 80.79 (white) Percent of facilities with no info on CWA_DAYS_LAST_INSPECTION 77.87 (minority) 68.86 (white) (edited)

shapironick commented 3 years ago

Incredible work, Eric!

ericnost commented 3 years ago

Watershed errors:

The Niagara River HUC8 watershed is listed as 04270101 see here: https://www.sciencebase.gov/catalog/item/5d6f6573e4b0c4f70cf90848 It is also the 427 one here https://www.arcgis.com/home/webmap/viewer.html?useExisting=1&layers=4c08f2e2b13741da96ad4a8f6aa5e36a

This gets rendered as 4270101 in our database. But looking at the database, the only HUCs listed for facilities in Buffalo are 4120103, 4120104, nan, 4140101, and 2010008. For Tonawanda, it's just 4120104 and nan

The Niagara River HUC8 is listed as 04120104 here https://water.usgs.gov/wsc/cat/04120104.html

So, unfortunately, it appears there are at least two HUC8 classification schemes in circulation. Seems worth digging into....I think the 427 is more recent and EPA has failed to update yet.

Frijol commented 3 years ago

Linking in @ericnost 's ideas for visualizing missing reports https://edgi.slack.com/archives/CKN6FMX42/p1615479261188300

ericnost commented 3 years ago

https://github.com/edgi-govdata-archiving/ECHO-Watershed/issues/14