edgi-govdata-archiving / ECHO_modules

ECHO_modules is a Python package for analyzing a copy of the US Environmental Protection Agency's (EPA) Enforcement and Compliance History Online (ECHO) database
GNU General Public License v3.0
3 stars 6 forks source link

Change approach to ECHO_EXPORTER and program data join? #3

Closed ericnost closed 4 years ago

ericnost commented 4 years ago

I think the biggest challenge when we try to get the program data and join it with select ECHO_EXPORTER columns. For something like Greenhouse Gases, where there are only a few thousand records, it doesn't take too long. But some people were trying Combined Air Emissions, with about 10,000 program records for MA and it was slow. There are about 33,000 matching records for Water Quality Violations and that join takes many minutes.

Currently, in both the Sunrise notebook and Cross-Program, the join works like this going back and forth between ECHO_EXPORTER and the program specific table:

So, we're adding a lot of redundant information, which is why it takes so long. We add ECHO_EXPORTER info like name, lat, long, etc. to each record in the program-specific data, when really we only need to add it for each unique facility in the program-specific data.

What we could do is "rollup" or group my_program_data by Registry ID before this join. The problem is that we need to know how to do this grouping for each program - which column(s) and do we sum or count them? Perhaps this is something we could start to figure out for each table and add to the make_data_sets.py class definition. We already accomplish this kind of data aggregation in show_chart(), but by date rather than program id.