edgi-govdata-archiving / ECHO_modules

ECHO_modules is a Python package for analyzing a copy of the US Environmental Protection Agency's (EPA) Enforcement and Compliance History Online (ECHO) database
GNU General Public License v3.0
3 stars 6 forks source link

Add docstrings for `make_data_sets` module #11

Closed Mr0grog closed 4 years ago

Mr0grog commented 4 years ago

This adds a docstring for the module (really just turns the existing comment into a docstring) and one for the make_data_sets() function.

Will add other docstrings as a separate PR. I wanted to do this one by itself so we can hash out any necessary discussion over the style (these use the Numpy style, which is a little verbose, but reads nicely in both code form and generated/parsed docs, and is what we use in all other EDGI python projects). Also, some open questions for discussion about language below.

I didn’t want to change the function or argument names, but looking at the implementation, I realized I had misunderstood what this was supposed to do based on the name and on the video tutorial I’d watched through a bit too quickly. To clarify, I described this as working with “preset configurations,” which seemed clearer to me. Happy to adjust if that doesn’t seem good to you.

In Google Colab, this displays like:

colab-docstring-popup

Separate but related: I didn’t make any implementation changes since I wanted to keep the scope narrowly focused on documentation here, but I noticed there’s a lot of repeated boilerplate here. I think it might be useful to rewrite this using a pattern like the following:

# List of presets and their associated options.
#
# These could also be actual `DataSet` objects instead of dicts since
# DataSets don't seem to do anything when constructed and therefore
# aren't too expensive (probably only slightly moreso than these dicts).
# On the other hand, I don't know if there's an expectation that
# repeated calls to `make_data_sets()` should construct new DataSet
# instances. It doesn't look like repeated calling is really intended,
# but I'm not familiar enough with all the notebooks.
# (Also, the dicts are safer for protecting yourself in case a future
# change makes DataSet actually do something at construction time.)
PRESETS = {
    "RCRA Violations": dict(
        idx_field="ID_NUMBER", 
        base_table="RCRA_VIOLATIONS",
        table_name="RCRA_VIOLATIONS_MVIEW",
        echo_type="RCRA",
        date_field="DATE_VIOLATION_DETERMINED",
        date_format="%m/%d/%Y",
        agg_type="count", 
        agg_col="VIOL_DETERMINED_BY_AGENCY", 
        unit="violations"
    ),

    "RCRA Inspections": dict(
        idx_field="ID_NUMBER", 
        base_table="RCRA_EVALUATIONS",
        table_name="RCRA_EVALUATIONS_MVIEW",
        echo_type="RCRA",
        date_field="EVALUATION_START_DATE",
        date_format="%m/%d/%Y", 
        agg_type="count",
        agg_col="EVALUATION_AGENCY", 
        unit="inspections"
    ),
    # etc.
}

def make_data_sets(dataset_names=None):
    return {name: DataSet(name=name, **PRESETS[name])
            for name in dataset_names or PRESETS.keys()}
Mr0grog commented 4 years ago

I am not familiar with Python's DataSet object actually, hence dictionaries. They are also, I think?, easier to turn into Pandas dataframes, which is ultimately what happens in the notebooks.

Oh! DataSet isn’t a Python built-in, it’s a class defined in this repo, here: https://github.com/edgi-govdata-archiving/ECHO_modules/blob/383e75c23fa8dba656590b18d177ea089d2b3889/DataSet.py#L49-L55

Pandas definitely has built-ins for turning lists and dictionaries into dataframes, but it looks like DataSet has a few methods that create or do things with dataframes.

My concern is that a user could run CAA Violations for MA-4 then CAA Violations for MA-5. We would need an update - a new DataSet instance should be created.

Makes sense! Then I think the better approach is one that matches the example I wrote out, with a dictionary of dictionaries listing the constructor arguments for DataSet (i.e. the “presets”). I’ll make a PR for that.