Chicago / predicting-e-coli-concentrations

This repository is part of the working draft for an upcoming an academic paper describing the methods and results of the City of Chicago Clear Water project.
2 stars 0 forks source link

Remarks on results and discussion sections #53

Closed tomschenkjr closed 6 years ago

tomschenkjr commented 6 years ago

The results and discussion sections need to be significantly improved by adding in references to the wealth of literature on beach monitoring worldwide. How does this model compare with other existing predictive models globally. If Chicago is using qPCR why don't they also look for pathogenic organisms and not just indicators? This would be something that would be more novel and interesting- even if the readers were to comment on this. What other factors can impact persistence of pathogens in the aquatic environment (temp, sunlight, wave intensity etc..). There is a wealth of literature now available through the Global Water Pathogen Project. Authors in the US such as Wade, Rose, Boehm have done lots of beach research. Others in Australia such as McCarthy, Deletic etc have also contributed to the literature in terms of beach monitoring. The authors need to better justify and compare their results and findings to the work of others. The authors have some of these references in their reference list- but they need to take them and use them in their discussion

tomschenkjr commented 6 years ago

For this, I think the proper resolution is to reframe the discussion to clarify the context of prior-day nowcast model versus hybrid now-cast modeling. The approach to our paper was (1) point-out that all predictive models use a combination of weather/hydrometerorological data and prior day lab tests (2) propose a new model form using qPCR and inter-beach correlations and (3) compare the model performance using hybrid model versus prior-day model at Chicago's beaches.

Here are some ideas for a reworked discussion section

Models attempting to forecast FIB levels in beaches have essentially used the same functional form. Lab data from the previous day is combined with various predictors in order to attempt to predict whether FIB levels will exceed the suggested thresholds. Innovations have occurred by finding novel ways to collect the predictors, such as hydrometerological sensors, that improve accuracy and save time. Likewise, more sophisticated algorithms, such as machine learning and genetic algorithms, have been used to improve performance.

Yet, the concept of these models still remain the same by relying on prior-day laboratory results, which we've dubbed the "prior-day nowcast model". Evidence suggests that the contributors to creating FIB do not persist from day-to-day (citations). That can explain why many attempts to predict FIB levels in beaches are relatively low. Despite improvements to analytical models, those models are still dependent on day-old FIB data.

Previous research has found that FIB levels in Chicago's beaches are highly correlated (citation) and Chicago beaches rarely encounter consecutive days of elevated FIB levels. At the same time, qPCR testing has become more widely used, but is still expensive. Because qPCR testing provides immediate results, we proposed the hybrid nowcast model to use limited qPCR data to predict FIB levels in other beaches.

Hyrbrid nowcast model removes the dependency on day-old FIB information that is commonly used in other models. This approach more closely resembles a "missing data" problem, where we are attempting to "fill-in" the missing values (beaches without qPCR testing). For beach networks that are highly correlated, like Chicago's, hybrid nowcasting was able to increase model sensitivity without increase the rate of false positives.

Hybrid modeling uses a different approach. By identifying "clusters" of beaches, we exploit the inter-beach correlation to formulate a prediction. While this model used a random forest, the analytical model could be adjusted to use other approaches, such as genetic algorithms. Likewise, we clustered beaches using a basic k-means algorithm, but other methods can also be used. In either case, it seems that a significant improvement comes from shifting away from prior-day lab results.

Second, the selected qPCR testing was tactically chosen for beaches with higher rates of exceeding acceptable FIB levels. This helps reduce the variance needed to be explained by the model...

nicklucius commented 6 years ago

This is great!

I'll add that qPCR was also chosen for beaches known in prior literature and/or found in our own study to have little predictive value, likely due to idiosyncratic geographical features. These beaches had higher rates of exceedances, which provided another tactical benefit. And by isolating beaches whose individual characteristics tend to contribute to outlier FIB levels, we were able to build a model excluding those beaches and only including beaches whose FIB levels tend toward the regional mean.