Chicago / predicting-e-coli-concentrations

This repository is part of the working draft for an upcoming an academic paper describing the methods and results of the City of Chicago Clear Water project.
2 stars 0 forks source link

Summarize project Methodology #1

Closed tomschenkjr closed 7 years ago

tomschenkjr commented 7 years ago

Summarize the project methodology for the "hybrid" predictive analytics model. Keep it as matter of fact:

Don't need to add a lot of narrative at this point. Main goal is to describe how we generate predictions.

Place it under the "Methodology" section.

tomschenkjr commented 7 years ago

There is a brief remark on breakwaters and how it impacts the model. This seems to be dying for either a brief statistical remark that supports the idea or a footnote. How do we know it influences the model? Does those beaches not correlated with others? Do all the breakwater beaches cluster in themselves?

tomschenkjr commented 7 years ago

There is this line:

A threshold was chosen to transform the model prediction to a binary outcome. To keep the model's false positive rate (FPR) near 1.5%, the threshold within each year that corresponded to a 1.5% FPR was noted, and the mean threshold was then used to generate predictions for the holdout validation set.

I think you have a note of the historical FPR from the model. What's the average FPR? That'll help clarify why we want it around 1.5 percent.

Similarly, what's the historical TPR average? These will really help highlight the improvements of the model (and something we should make into a graph).

nicklucius commented 7 years ago

Regarding the breakwaters - we could build it up with a few things. There is talk in the prior literature about a theory associating breakwaters with higher e. coli concentrations:

"In some instances, most notably 63rd Street and Montrose, the breakwaters may effectively trap contamination that is moving along the coast with the current or they may help retain contamination at the swimming beaches that originates from terrestrial sources(e.g., beach sand, runoff)."

"E. coli concentrations at 63rd Street beach in particular are greatly influenced by the presence of breakwaters that enclose the swimming area and cause embayment conditions (10). Montrose has a similar substantial breakwater system. The prevailing long current carries suspended materials south,and the breakwaters likely trap these materials in the near-beach areas; this scenario may be amplified at 63rd Street."

So I measured all the breakwaters from the southernmost part of the beach to the northeasternmost edge of the breakwater, like this:

image

The length, measured this way, was positively correlated (R-squared = .54) with total number of E. coli exceedances from 2006 - 2017.

image

However, when I add the breakwater length to the model, there is no performance enhancement. It might be that since beach is already a variable, the effect of the breakwaters is mostly realized within the decision trees. Plus, the longest breakwaters and worst beaches are already removed by that point of the analysis.

nicklucius commented 7 years ago

Regarding the historical USGS all-time model performance -

TPR FPR
All Beaches 9.0% 1.8%
10 Predicted Beaches (proposed) 3.3% 1.5%
tomschenkjr commented 7 years ago

Do you have the ROC graph for the training data?

nicklucius commented 7 years ago

Here is the ROC for the final holdout set (2016) after training on 2006 - 2015.

image

For a FPR of 1.5%, the TPR is 21.6%.