Summarize project Methodology

tomschenkjr commented 7 years ago

Summarize the project methodology for the "hybrid" predictive analytics model. Keep it as matter of fact:

How were the k-means computations were completed
The process of omitting beaches and using the "intermediate" beaches and correlations to predict new beaches
The process of creating new predictions on a daily basis from the qPCR tests
The metrics which validate the model (not the 2017 pilot, but the metrics we used during development

Don't need to add a lot of narrative at this point. Main goal is to describe how we generate predictions.

Place it under the "Methodology" section.

tomschenkjr commented 7 years ago

There is a brief remark on breakwaters and how it impacts the model. This seems to be dying for either a brief statistical remark that supports the idea or a footnote. How do we know it influences the model? Does those beaches not correlated with others? Do all the breakwater beaches cluster in themselves?

tomschenkjr commented 7 years ago

There is this line:

A threshold was chosen to transform the model prediction to a binary outcome. To keep the model's false positive rate (FPR) near 1.5%, the threshold within each year that corresponded to a 1.5% FPR was noted, and the mean threshold was then used to generate predictions for the holdout validation set.

I think you have a note of the historical FPR from the model. What's the average FPR? That'll help clarify why we want it around 1.5 percent.

Similarly, what's the historical TPR average? These will really help highlight the improvements of the model (and something we should make into a graph).

nicklucius commented 7 years ago

Regarding the breakwaters - we could build it up with a few things. There is talk in the prior literature about a theory associating breakwaters with higher e. coli concentrations:

"In some instances, most notably 63rd Street and Montrose, the breakwaters may effectively trap contamination that is moving along the coast with the current or they may help retain contamination at the swimming beaches that originates from terrestrial sources(e.g., beach sand, runoff)."

"E. coli concentrations at 63rd Street beach in particular are greatly inﬂuenced by the presence of breakwaters that enclose the swimming area and cause embayment conditions (10). Montrose has a similar substantial breakwater system. The prevailing long current carries suspended materials south,and the breakwaters likely trap these materials in the near-beach areas; this scenario may be ampliﬁed at 63rd Street."

So I measured all the breakwaters from the southernmost part of the beach to the northeasternmost edge of the breakwater, like this:

The length, measured this way, was positively correlated (R-squared = .54) with total number of E. coli exceedances from 2006 - 2017.

However, when I add the breakwater length to the model, there is no performance enhancement. It might be that since beach is already a variable, the effect of the breakwaters is mostly realized within the decision trees. Plus, the longest breakwaters and worst beaches are already removed by that point of the analysis.

nicklucius commented 7 years ago

Regarding the historical USGS all-time model performance -

	TPR	FPR
All Beaches	9.0%	1.8%
10 Predicted Beaches (proposed)	3.3%	1.5%

tomschenkjr commented 7 years ago

Do you have the ROC graph for the training data?

nicklucius commented 7 years ago

Here is the ROC for the final holdout set (2016) after training on 2006 - 2015.

For a FPR of 1.5%, the TPR is 21.6%.

Chicago / predicting-e-coli-concentrations

Summarize project Methodology #1