edgi-govdata-archiving / ECHO-Cross-Program

Jupyter Notebooks for ECHO that use data from multiple EPA programs
https://colab.research.google.com/github/edgi-govdata-archiving/ECHO-Cross-Program/blob/master/ECHO-Cross-Programs.ipynb
GNU General Public License v3.0
9 stars 5 forks source link

Add state vs federal actions #25

Open ericnost opened 4 years ago

ericnost commented 4 years ago

See here for relevant fields: https://docs.google.com/document/d/1EWVy52R1Eqy5dc5mjXK7m5EOdVyBVXbI9hkL-ps4BlI/edit

Related to edgi-govdata-archiving/EEW_Planning/issues/98

ericnost commented 4 years ago

Here is one way to do it:

Add an "agency_field" property to the data class. Add the appropriate agency_field for each of the tables (using the above doc). Then, for the aggregation and charts:

program_data['State'] = np.where(program_data[program.agency_field]=="S", 1,0)
program_data['Federal']= np.where(program_data[program.agency_field]=="E", 1,0)

g = program_data.groupby([pd.to_datetime(program_data[program.date_field], format=program.date_format)])[["State", "Federal"]].sum() #for each date, take the sum of state inspections and federal inspections
g = g.resample("Y").sum() #sum of inspections for the year
g.index = g.index.strftime('%Y') #pretty format
g

ax = g.plot(kind='bar', stacked=True, title = program.name, figsize=(20, 10), fontsize=16)
ax

Example output: image

These feels rather hacky, creating new columns to count State inspections and Federal inspections. One thing to note here - and to always pay attention to - is using sum vs count to aggregate. In this case, I think we want to sum? This shows us the number of inspections, NOT the number of facilities inspected. If we wanted that, I think we would use count.

ericnost commented 4 years ago

@shansen5 One thing I just noticed in the Cross-Program notebook. An error I probably made!

For the charts, we currently do this:

d = program_data.groupby(pd.to_datetime(program_data[program.date_field], format=program.date_format))[[program.date_field]].count()
d = d.resample("Y").count()
d.index = d.index.strftime('%Y')

ax = d.plot(kind='bar', title = chart_title, figsize=(20, 10), legend=False, fontsize=16)
 ax

But I think d = d.resample("Y").count() should be d = d.resample("Y").sum() We want to sum the number of inspections, enforcements, violations that occurred within the year.

Might have some relation to #22 though I think there we are seeing the problem before grouping and charting?

^ I believe this separate issue has been addressed

ericnost commented 4 years ago

Challenges: for some things state/federal actions make sense to chart (inspections, enforcement actions, penalties), but for others (violations), not so much. This nuance just adds another layer of logic required for the notebook (if inspections/actions/penalties, chart state vs federal...else don't).

ericnost commented 4 years ago

With respect to not all the info being in the same place for each program, it would be helpful for someone to list out for each of the following nine data tables:

Some if not all of this is here: https://docs.google.com/document/d/1EWVy52R1Eqy5dc5mjXK7m5EOdVyBVXbI9hkL-ps4BlI/edit But it would be helpful to compile it based on my bullet points above.

Another thing to keep in mind is that it's more than just state vs federal - there are often local actions and tribal actions listed. How do we handle those? Do we just not count them?

What do we do about SDWA? That's another 3 or 4 tables. Maybe we should disable that because we don't have capacity to support it??

ericnost commented 4 years ago

Specifically, we would need to modify at least the show_chart() function here: https://github.com/edgi-govdata-archiving/ECHO_modules/blob/b0a288e4cca2d3a46e5378fb51986c034d455c2c/DataSetResults.py#L22

If we wanted to be able to export CSVs that distinguish between state and federal actions, that would require something else....