Chicago / predicting-e-coli-concentrations

This repository is part of the working draft for an upcoming an academic paper describing the methods and results of the City of Chicago Clear Water project.
2 stars 0 forks source link

Phantom changes in commit 777f26c1fc2d532cfd0e0748e58720cdc72c174c #75

Closed nicklucius closed 6 years ago

nicklucius commented 6 years ago

The most recent commit to dev seems to have made changes to predicting-e-coli-concentrations.Rmd that are not visible when looking at the diff in GitHub. I can see the changes using this command

git diff e826d 777f26

Here are the changes in full, copied from the shell (the text is clipped at the right of the screen). It appears to be a rollback of the last few commits:

diff --git a/bibliography/zotero-references.bib b/bibliography/zotero-references.bib
index 0dbe519..8cc3b00 100644
--- a/bibliography/zotero-references.bib
+++ b/bibliography/zotero-references.bib
@@ -963,4 +963,40 @@ increase in the indicator-bacteria count in recreational waters. Relative risk (
        year = {2012},
        pages = {69},
        file = {Recreational Water Quality Criteria.pdf:C\:\\Users\\362222\\Zotero\\storage\\4N8QL3DD\\Recreational Water Quality Criteria.pdf:application/pdf}
+}
+
+@article{dorevitch_monitoring_2017,
+       title = {Monitoring urban beaches with {qPCR} vs. culture measures of fecal indicator bacteria: {Implications} for public notification},
+       volume = {16},
+       issn = {1476-069X},
+       shorttitle = {Monitoring urban beaches with {qPCR} vs. culture measures of fecal indicator bacteria},
+       url = {https://doi.org/10.1186/s12940-017-0256-y},
+       doi = {10.1186/s12940-017-0256-y},
+       abstract = {The United States Environmental Protection Agency has established methods for testing beach water using the rapid quantitative polymerase chain reaction (qPCR) method, as well as “be
+       urldate = {2018-04-28},
+       journal = {Environmental Health},
+       author = {Dorevitch, Samuel and Shrestha, Abhilasha and DeFlorio-Barker, Stephanie and Breitenbach, Cathy and Heimler, Ira},
+       month = may,
+       year = {2017},
+       keywords = {Fecal indicator bacteria, Beach management, Quantitative polymerase chain reaction (qPCR), Surface water monitoring, Water pollution},
+       pages = {45},
+       file = {Full Text PDF:C\:\\Users\\362222\\Zotero\\storage\\T29PBC6U\\Dorevitch et al. - 2017 - Monitoring urban beaches with qPCR vs. culture mea.pdf:application/pdf;Snapshot:C\:\\Users\\362222\
+}
+
+@article{griffith_challenges_2011,
+       title = {Challenges in {Implementing} {New} {Technology} for {Beach} {Water} {Quality} {Monitoring}: {Lessons} {From} a {California} {Demonstration} {Project}},
+       volume = {45},
+       issn = {00253324},
+       shorttitle = {Challenges in {Implementing} {New} {Technology} for {Beach} {Water} {Quality} {Monitoring}},
+       url = {http://openurl.ingenta.com/content/xref?genre=article&issn=0025-3324&volume=45&issue=2&spage=65},
+       doi = {10.4031/MTSJ.45.2.13},
+       language = {en},
+       number = {2},
+       urldate = {2018-04-28},
+       journal = {Marine Technology Society Journal},
+       author = {Griffith, John F. and Weisberg, Stephen B.},
+       month = mar,
+       year = {2011},
+       pages = {65--73},
+       file = {Griffith and Weisberg - 2011 - Challenges in Implementing New Technology for Beac.pdf:C\:\\Users\\362222\\Zotero\\storage\\3LL8YLS3\\Griffith and Weisberg - 2011 - Challenges in Implemen
 }
\ No newline at end of file
diff --git a/predicting-e-coli-concentrations.Rmd b/predicting-e-coli-concentrations.Rmd
index faa550f..2399532 100644
--- a/predicting-e-coli-concentrations.Rmd
+++ b/predicting-e-coli-concentrations.Rmd
@@ -69,7 +69,6 @@ library(gridExtra)
 library(gtable)
 library(knitcitations)
 library(knitr)
-library(pROC)
 library(ROCR)
 library(RSocrata)
 webshot::install_phantomjs() #once installed on your computer, this can be commented out
@@ -139,9 +138,7 @@ Declarations of interest: none

 # Introduction
-Swimming in recreational waters contaminated with high levels of fecal indicator bacteria (FIB) is associated with higher incidents of gastrointestinal (GI) illnesses `r citep(biblio["pruss_review_1998
-
-To prevent this, managers of recreational beaches use culture-based methods to measure levels of FIB levels. Sampling is conducted early in the morning, but results take upward of 12 hours `r citep(bib
+Managers of recreational beaches use culture-based methods to measure levels of fecal indicator bacteria (FIB) levels at recreational beaches. Sampling is conducted early in the morning, but results ta

 These models often incorrectly predict that beaches will not have elevated FIB levels--known as Type II errors or "false negatives" `r citep(biblio[c("nevers_efficacy_2011", "rabinovici_economic_2004",

@@ -149,7 +146,7 @@ Meanwhile, scientists have developed new methods which measure bacteria levels i

 However, this approach has a drawback of cost and equipment availability. qPCR testing can cost between 2 to 5 times more than traditional culture-based methods `r citep(biblio["bienkowski_dna_nodate"]

-Both of these tests have generally-acceptable thresholds for acceptable FIB levels. The Environmental Protection Agency (EPA) publishes water quality criteria in accordance with the Clean Water Act. Th
+Both of these tests have generally-acceptable thresholds for acceptable FIB levels. Acceptable levels for culture-based methods should not exceed 235 CFU/100 ml while acceptable levels for qPCR testing

 We exploit the historical correlation between beaches to estimate FIB readings. We use limited qPCR results at specific beaches to predict elevated bacteria levels at other beaches using clustering alg

@@ -170,15 +167,25 @@ where $f(...)$ is a some function or algorithm that inputs raw data and outputs

 Various models are used to improve accuracy, such as log transformations `r citep(biblio["nevers_nowcast_2005"])`, polynomial coefficients `r citep(biblio["frick_nowcasting_2008"])`, logistic regressio

-Yet, the reliance on prior-day bacteria levels are likely the source of inaccuracy. The cause of elevated levels is unlikely to persist between days `r citep(biblio[c("morrison_receiver_2003","cheung_v
+Yet, the reliance on prior-day bacteria levels are likely the source of inaccuracy. The cause of elevated levels is unlikely to persist between days. Covariates can help determine when conditions are o

 ## Chicago Prior-day Nowcast Model

-Chicago Park District measures water quality of 20 beaches along 26 miles against Lake Michigan on Chicago's eastern shore. There is no single source of bacteria in Lake Michigan; rather it is likely i
+Chicago Park District measures water quality of 20 beaches along 26 miles against Lake Michigan on Chicago's eastern shore. FIB in Lake Michigan are introduced through multiple mechanisms and rarely fr
+
+Wave activity can transport FIB to beaches and encourage resuspesion (Ge, 2010). Moreover, wet sand can act as a repository for FIB, storing it between seasons and outside of cyclical patterns (Alm, 20
+
+The effect of waves can be accentuated when beaches have a U-shape where FIB can be resuspended or trapped in swash zones (Ge, 2012). Likewise, breakwaters contribute to the existence of FIB in some be
+
+Birds are also a significant source of contamination at beaches (Levesque, 2000; Haack, 2003; Alm, 2018). Gulls, in particular, are often associated with FIB concentrations in beaches. Contamination ca

-Beaches operate from Memorial Day weekend, which is just before the last Monday in May, through Labor Day, which is the first Monday in September -- approximately 122 days. As a result, there are appro
+Lack of solar radiation can resuscitate _E. coli_ levels overnight and reduce the rate of inactivation during cloudy days (Whitman, 2003).

-Between 2011 and 2016, Chicago Park District placed hydrometeorological sensors to automatically collect covariates on water and atmospheric conditions. Buoys were installed at five Chicago beaches--Fo
+The Chicago River is contiguous to Lake Michigan but a lock limits water flowing into the lake. The City of Chicago has a combined sewer system which, during excess rain events, must redirect excess ma
+
+Beaches operate between Memorial Day weekend through Labor Day -- approximately 122 days. As a result, there are approximately 2,440 "beach days," which each represent an observation in the model. An i
+
+Between 2011 and 2016, Chicago Park District placed hydrometeorological sensors to automatically collect covariates on water and atmospheric conditions. Buoys were installed at five Chicago beaches--Fo

 Water samples were collected for culture-based testing each morning and recorded, usually around noon. Sampling was done on weekdays; however, weekend and holiday sampling was conducted if the prior re

@@ -186,11 +193,11 @@ Water samples were collected for culture-based testing each morning and recorded

 Predictions were obtained from a random forest model and the predictions were published online for beach visitors. In 2015 and 2016, the overall accuracy was 90 and 93%, respectively, and specificity (

-Beginning in 2015, Chicago Park District began to use limited qPCR testing of enterococci at five beaches. Data was collected but not incorporated into the predictive model. During the summer of 2017,
+Beginning in 2015, Chicago Park District began to use limited qPCR testing at five beaches. Data was collected but not incorporated into the predictive model. During the summer of 2017, qPCR testing wa

 ## Hybrid Nowcast Model

-`r citet(biblio["whitman_summer_2008"])` observed that bacteria levels at Chicago beaches often fluctuate with each other on the same day where extreme highs and extreme lows were simultaneous for most
+`r citet(biblio["whitman_summer_2008"])` observed that bacteria levels at Chicago beaches often fluctuate with each other on the same day where extreme highs and extreme lows were simultaneous for most

 ```{r correlation_heatmap, echo=FALSE, fig.width=5, fig.height=4, fig.align='center', fig.cap='Pearson correlation coefficient heat map of daily E. coli levels at Chicago beaches between 2006 and 2017.

@@ -223,7 +230,7 @@ To generate the predictions, we can formulate a model that limits predictions to
 $$ x_{i \in k}^t = f \left( \hat{x}_{i \in k}^t \right) $$
 so the feature beach $\hat{x}_{i \in k}^t$, the $i^\textrm{th}$ beach in cluster $k$ uses data from time $t$ to predict the remaining beaches ($x_{i \in k}^t$) in the same cluster in the same time peri

-This model leverages observations from the same time $t$ to predict FIB levels at the other beaches on the same day. The rapid results from qPCR testing of enterococci allows recreational beach manager
+This model leverages observations from the same time $t$ to predict bacteria levels at the other beaches on the same day. The rapid results from qPCR testing allows recreational beach managers to obser

 ## Identifying Beach Clusters

@@ -256,7 +263,7 @@ knitr::kable(table_beach_clusters, caption = "Final results of K-means clusterin ```

-Within each cluster, the beach with the most historical culture-based _E. coli_ exceedances was selected to be the feature beach whose enterococci qPCR result would be input to the model. By rapid test
+Within each cluster, the beach with the most _E. coli_ exceedances was selected to be the feature beach whose qPCR result would be input to the model. By rapid testing the beaches with the most freques

 ## Building the Predictive Model

@@ -389,8 +396,6 @@ tpr2017Pilot <- tp2017Pilot / (tp2017Pilot + fn2017Pilot)
 fpr2017Pilot <- fp2017Pilot / (fp2017Pilot + tn2017Pilot)
 acc2017Pilot <- (tp2017Pilot + tn2017Pilot) / (tp2017Pilot + tn2017Pilot + fp2017Pilot + fn2017Pilot)
 prec2017Pilot <- tp2017Pilot / (tp2017Pilot + fp2017Pilot)
-roc2017Pilot <- roc(pilot2017$actualHigh, pilot2017$Predicted.Level)
-auc2017Pilot <- auc(roc2017Pilot)[1]

 # usgs 2016 results

@@ -434,8 +439,7 @@ tpr2016usgs <- tp2016usgs / (tp2016usgs + fn2016usgs)
 fpr2016usgs <- fp2016usgs / (fp2016usgs + tn2016usgs)
 acc2016usgs <- (tp2016usgs + tn2016usgs) / (tp2016usgs + tn2016usgs + fp2016usgs + fn2016usgs)
 prec2016usgs <- tp2016usgs / (tp2016usgs + fp2016usgs)
-roc2016usgs <- roc(usgs2016$actualHigh, usgs2016$Predicted.Level)
-auc2016usgs <- auc(roc2016usgs)[1]
+

 # usgs 2015 results

@@ -478,8 +482,6 @@ tpr2015usgs <- tp2015usgs / (tp2015usgs + fn2015usgs)
 fpr2015usgs <- fp2015usgs / (fp2015usgs + tn2015usgs)
 acc2015usgs <- (tp2015usgs + tn2015usgs) / (tp2015usgs + tn2015usgs + fp2015usgs + fn2015usgs)
 prec2015usgs <- tp2015usgs / (tp2015usgs + fp2015usgs)
-roc2015usgs <- roc(usgs2015$actualHigh, usgs2015$Predicted.Level)
-auc2015usgs <- auc(roc2015usgs)[1]

 plotDf <- data.frame("rate" = c("True Positive Rate","True Positive Rate","True Positive Rate",
                                 "False Positive Rate","False Positive Rate","False Positive Rate"),
@@ -515,12 +517,9 @@ modelCompData <- data.frame("Model" = c("2017 Hybrid", "2016 Prior-day", "2015 P
                                             round(prec2015usgs, 3)),
                             "Accuracy" = c(round(acc2017Pilot, 3),
                                            round(acc2016usgs, 3),
-                                           round(acc2015usgs, 3)),
-                            "AUC" = c(round(auc2017Pilot, 3),
-                                      round(auc2016usgs, 3),
-                                      round(auc2015usgs, 3)))
+                                           round(acc2015usgs, 3)))

-knitr::kable(modelCompData, format = "pandoc", caption = "Comparing specificity, sensitivity, accuracy, and AUC between Hybrid and Prior-day Nowcast models.")
+knitr::kable(modelCompData, format = "pandoc", caption = "Comparing specificity, sensitivity, and accuracy between Hybrid and Prior-day Nowcast models.")

@@ -535,7 +534,7 @@ The difference in performance is likely due to the use of bacteria levels on the

 Although Chicago was able to deploy qPCR testing at all 20 beaches, the cost and complexity of qPCR equipment limits widescale deployment. For instance, the lab conducting qPCR testing for Chicago has

@@ -256,7 +263,7 @@ knitr::kable(table_beach_clusters, caption = "Final results of K-means clusterin ```

-Within each cluster, the beach with the most historical culture-based _E. coli_ exceedances was selected to be the feature beach whose enterococci qPCR result would be input to the model. By rapid test
+Within each cluster, the beach with the most _E. coli_ exceedances was selected to be the feature beach whose qPCR result would be input to the model. By rapid testing the beaches with the most freques

 ## Building the Predictive Model

@@ -389,8 +396,6 @@ tpr2017Pilot <- tp2017Pilot / (tp2017Pilot + fn2017Pilot)
 fpr2017Pilot <- fp2017Pilot / (fp2017Pilot + tn2017Pilot)
 acc2017Pilot <- (tp2017Pilot + tn2017Pilot) / (tp2017Pilot + tn2017Pilot + fp2017Pilot + fn2017Pilot)
 prec2017Pilot <- tp2017Pilot / (tp2017Pilot + fp2017Pilot)
-roc2017Pilot <- roc(pilot2017$actualHigh, pilot2017$Predicted.Level)
-auc2017Pilot <- auc(roc2017Pilot)[1]

 # usgs 2016 results

@@ -434,8 +439,7 @@ tpr2016usgs <- tp2016usgs / (tp2016usgs + fn2016usgs)
 fpr2016usgs <- fp2016usgs / (fp2016usgs + tn2016usgs)
 acc2016usgs <- (tp2016usgs + tn2016usgs) / (tp2016usgs + tn2016usgs + fp2016usgs + fn2016usgs)
 prec2016usgs <- tp2016usgs / (tp2016usgs + fp2016usgs)
-roc2016usgs <- roc(usgs2016$actualHigh, usgs2016$Predicted.Level)
-auc2016usgs <- auc(roc2016usgs)[1]
+

 # usgs 2015 results

@@ -478,8 +482,6 @@ tpr2015usgs <- tp2015usgs / (tp2015usgs + fn2015usgs)
 fpr2015usgs <- fp2015usgs / (fp2015usgs + tn2015usgs)
 acc2015usgs <- (tp2015usgs + tn2015usgs) / (tp2015usgs + tn2015usgs + fp2015usgs + fn2015usgs)
 prec2015usgs <- tp2015usgs / (tp2015usgs + fp2015usgs)
-roc2015usgs <- roc(usgs2015$actualHigh, usgs2015$Predicted.Level)
-auc2015usgs <- auc(roc2015usgs)[1]

 plotDf <- data.frame("rate" = c("True Positive Rate","True Positive Rate","True Positive Rate",
                                 "False Positive Rate","False Positive Rate","False Positive Rate"),
@@ -515,12 +517,9 @@ modelCompData <- data.frame("Model" = c("2017 Hybrid", "2016 Prior-day", "2015 P
                                             round(prec2015usgs, 3)),
                             "Accuracy" = c(round(acc2017Pilot, 3),
                                            round(acc2016usgs, 3),
-                                           round(acc2015usgs, 3)),
-                            "AUC" = c(round(auc2017Pilot, 3),
-                                      round(auc2016usgs, 3),
-                                      round(auc2015usgs, 3)))
+                                           round(acc2015usgs, 3)))

-knitr::kable(modelCompData, format = "pandoc", caption = "Comparing specificity, sensitivity, accuracy, and AUC between Hybrid and Prior-day Nowcast models.")
+knitr::kable(modelCompData, format = "pandoc", caption = "Comparing specificity, sensitivity, and accuracy between Hybrid and Prior-day Nowcast models.")
-535,7 +534,7 @@ The difference in performance is likely due to the use of bacteria levels on the

 Although Chicago was able to deploy qPCR testing at all 20 beaches, the cost and complexity of qPCR equipment limits widescale deployment. For instance, the lab conducting qPCR testing for Chicago has

-We removed four beaches from this model because of physical characteristics, such as long breakwaters, that prevented them from being clustered with other beaches; one beach was removed due to frequent
+We removed four beaches from this model because of physical characteristics, such as long breakwaters, that prevented them from being clustered with other beaches; one beach was removed due to frequent

 This model selected feature beaches that were prone to elevated bacteria levels; Foster, North Avenue, Leone, 31st, and South Shore. Moreover, we removed Calumet from the model since it frequently had
nicklucius commented 6 years ago

To fix this, I'll create a new branch for this issue based off https://github.com/Chicago/predicting-e-coli-concentrations/commit/e826d8b9feb57771fcc2367a479f3875d9ea90cb. Then I'll replace the Zotero file with the copy contained in https://github.com/Chicago/predicting-e-coli-concentrations/commit/777f26c1fc2d532cfd0e0748e58720cdc72c174c. Then I'll submit a pull request from the issue branch to dev.