edgi-govdata-archiving / ECHO_modules

ECHO_modules is a Python package for analyzing a copy of the US Environmental Protection Agency's (EPA) Enforcement and Compliance History Online (ECHO) database
GNU General Public License v3.0
3 stars 6 forks source link

consider moving some utilities to the DataSetResults class #61

Open ericnost opened 10 months ago

ericnost commented 10 months ago

Currently, there are several functions in utilities.py that seem like they could be methods of the DataSetResults class because they are mostly used on data that has already been loaded in a DataSetResults instance.

For instance, instead of this:

ds = make_data_sets(["CWA Inspections"]) # Create a DataSet for handling the data
buffalo_cwa_inspections = ds["CWA Inspections"].store_results(
  region_type="Zip Code", region_value=["14201", "14202", "14303"]
) # Store results for this DataSet as a DataSetResults object

aggregated_results = aggregate_by_facility(
  records = buffalo_cwa_inspections, program = buffalo_cwa_inspections.dataset.name, other_records=True
) # Aggregate each entry using this function
point_mapper(
  aggregated_results["data"], aggregated_results["aggregator"], quartiles=True, other_fac=aggregated_results["diff"]
)

We could do something like:

ds = make_data_sets(["CWA Inspections"]) # Create a DataSet for handling the data
buffalo_cwa_inspections = ds["CWA Inspections"].store_results(
  region_type="Zip Code", region_value=["14201", "14202", "14303"]
) 
# Store results for this DataSet as a DataSetResults object. Note: one thing that would be really neat to do here is to also retrieve *spatial data* in the store_results request. Instead of just getting CWA inspections for ZIPs 14201, 14202, and 14303, we could get the outlines of those geographies. Currently exists in some form in the `reorganization` branch. 
buffalo_cwa_inspections.aggregate_by_facility() 
# This utilities.py function would become a DataSetResults method that would store the aggregated data in a `self` variable for later use
buffalo_cwa_inspections.show_facility_map() 
# This would be a basic map of each facility (with inspections). Currently exists in some form in the `reorganization` branch. Would rely on the aggregate_by_facility() function to work properly
buffalo_cwa_inspections.show_data_map() 
# Basically just what's currently called `point_mapper()`. Would symbolize facilities with inspections by circle size. If the spatial data (e.g. ZIP code boundaries) is already available, it could map those as well. 

Eventually, perhaps even other utilities like get_active_facilities() and get_top_violators() could move too. Currently, that would break the report cards generating process, I believe. It's also true that these have less to do with the program specific data that's usually stored in a DataSetResults instance. However, it's also the case that an area's facilities can be loaded using ds = make_data_sets(["Facilities"]) get_active_facilities() and get_top_violators() could then become methods for those specific DataSetResults instances.

ericnost commented 10 months ago

Something like this I think is a more straightforward way of getting facilities. If we moved "get_active_facilities()` or a copy of it to DataSetResults, then we could also create a "active=True" flag.

from ECHO_modules.make_data_sets import make_data_sets
ds = make_data_sets(["Facilities"]) # Create a DataSet for handling the data
erie_facs = ds["Facilities"].store_results(region_type="County", region_value=["ERIE"], state="NY", active=True) # Store results for this DataSet as a DataSetResults object
erie_facs.dataframe # Show the results as a dataframe