edgi-govdata-archiving / ECHO-Cross-Program

Jupyter Notebooks for ECHO that use data from multiple EPA programs
https://colab.research.google.com/github/edgi-govdata-archiving/ECHO-Cross-Program/blob/master/ECHO-Cross-Programs.ipynb
GNU General Public License v3.0
8 stars 5 forks source link

Move code block 10 to separate file? #55

Closed ericnost closed 4 years ago

ericnost commented 4 years ago

The code in block 10, for getting program-specific data, could be moved to a utilities file.

ericnost commented 4 years ago

Related to #48

ericnost commented 4 years ago

Related, the code in block 10 itself could be optimized:

echo_row = pd.DataFrame(echo_data.loc[reg_id].copy()).T.reset_index() # Find and filter to the corresponding row in ECHO_EXPORTER
echo_row = echo_row[['FAC_NAME', 'FAC_LAT', 'FAC_LONG']] # Keep only the columns we need
program_row =  pd.DataFrame([list(fac)[1:]], columns=program_data.columns.values) # Turn the program_data tuple into a DataFrame
full_row = pd.concat([program_row, echo_row], axis=1) # Join the EE row df and the program row df
frames = [my_prog_data, full_row]
my_prog_data = pd.concat( frames, ignore_index=False)

Though Steve said it was elegant :) it's actually creating a lot of unnecessary data frames and .concat is slow.

I moved the Sunrise notebook's version of this to a utilities file. Could be worth implementing that here.

ericnost commented 4 years ago

For instance, it took me 10 minutes to pull Water Quality Violation data for LA-2 using the existing code.

It took 1.5 minutes to do so using this:

my_prog_data=[]
...
e=echo_data.loc[echo_data.index==reg_id].copy()[['FAC_NAME', 'FAC_LAT', 'FAC_LONG', 'DFR_URL']].to_dict('index')
e = e[reg_id] # remove indexer
p =  fac._asdict()
e.update(p)
my_prog_data.append(e)
...
pd.DataFrame(my_prog_data)

(Difference may be in part just the database's ability to respond at the moment)

shansen5 commented 4 years ago

Definitely move this, probably into make_data_sets.py.