ECHO_modules is a Python package for analyzing a copy of the US Environmental Protection Agency's (EPA) Enforcement and Compliance History Online (ECHO) database
GNU General Public License v3.0
3
stars
6
forks
source link
Change approach to ECHO_EXPORTER and program data join? #3
I think the biggest challenge when we try to get the program data and join it with select ECHO_EXPORTER columns. For something like Greenhouse Gases, where there are only a few thousand records, it doesn't take too long. But some people were trying Combined Air Emissions, with about 10,000 program records for MA and it was slow. There are about 33,000 matching records for Water Quality Violations and that join takes many minutes.
Currently, in both the Sunrise notebook and Cross-Program, the join works like this going back and forth between ECHO_EXPORTER and the program specific table:
Compile each relevant program ID and registry ID in ECHO_EXPORTER (e.g. each NPDES ID),
Query the relevant table with these IDs,
For each record that's returned,
Go back and look up this facility's Registry ID through its Program ID (from step 1),
-Based on that, add relevant ECHO_EXPORTER columns to the program data record
So, we're adding a lot of redundant information, which is why it takes so long. We add ECHO_EXPORTER info like name, lat, long, etc. to each record in the program-specific data, when really we only need to add it for each unique facility in the program-specific data.
What we could do is "rollup" or group my_program_data by Registry ID before this join. The problem is that we need to know how to do this grouping for each program - which column(s) and do we sum or count them? Perhaps this is something we could start to figure out for each table and add to the make_data_sets.py class definition. We already accomplish this kind of data aggregation in show_chart(), but by date rather than program id.
I think the biggest challenge when we try to get the program data and join it with select ECHO_EXPORTER columns. For something like Greenhouse Gases, where there are only a few thousand records, it doesn't take too long. But some people were trying Combined Air Emissions, with about 10,000 program records for MA and it was slow. There are about 33,000 matching records for Water Quality Violations and that join takes many minutes.
Currently, in both the Sunrise notebook and Cross-Program, the join works like this going back and forth between ECHO_EXPORTER and the program specific table:
So, we're adding a lot of redundant information, which is why it takes so long. We add ECHO_EXPORTER info like name, lat, long, etc. to each record in the program-specific data, when really we only need to add it for each unique facility in the program-specific data.
What we could do is "rollup" or group my_program_data by Registry ID before this join. The problem is that we need to know how to do this grouping for each program - which column(s) and do we sum or count them? Perhaps this is something we could start to figure out for each table and add to the make_data_sets.py class definition. We already accomplish this kind of data aggregation in show_chart(), but by date rather than program id.