Chicago / predicting-e-coli-concentrations

This repository is part of the working draft for an upcoming an academic paper describing the methods and results of the City of Chicago Clear Water project.
2 stars 0 forks source link

Were 63rd, Rainbow, Montrose, Ohio, and Calumet qPCR tested or predicted? #7

Closed kbrose closed 6 years ago

kbrose commented 6 years ago

We removed 63rd Street, Rainbow, Montrose, and Ohio, which have long breakwaters or a similar feature. Likewise, our earlier analytical modeling showed that beaches with a high frequency of high bacteria levels often confounded the model. Calumet, which had high exceedances as well as a medium-sized breakwater, was also removed from the analytical model. These 5 removed beaches comprised two of the clusters from the K-means analysis.

The remaining three clusters were reanalyzed using 5-cluster k-means with the same variables.

cluster-table

Does this mean the five beaches listed were neither tested nor predicted in the hybrid model? It sounded like they were put into their own two of the five cluster, but then the table does not list them anywhere in the five clusters.

nicklucius commented 6 years ago

@kbrose - that's right, those 5 beaches were removed from the analysis at the clustering stage, prior to modeling. So, the model does not use their test results as a feature or predict them. But, in the hybrid method as a whole, these beaches are designated to be tested with qPCR. The idea is that they are not good for use in prediction, and account for something like 50% of all exceedances.

After their removal, the remaining beaches were reclustered to get the 5 clusters above.

kbrose commented 6 years ago

Ok, so that results in 10 beaches being qPCR tested in total?

nicklucius commented 6 years ago

Correct.

kbrose commented 6 years ago

Ok cool. So is this quote describing the output of that clustering process or is it talking about something else?

The feature beaches were Calumet, Rainbow, South Shore, 63rd Street, and Montrose, and the predicted beaches were the other 15 regularly tested beaches. The model was trained using qPCR test results for 2015 and 2016 and fit to predict the culture-based levels...

nicklucius commented 6 years ago

This is talking about something else, so I'll explain. To perform the clustering and validate the hybrid model concept, we used the 10 years of culture testing results for 20 beaches, since it is accepted that culture tests of E. coli and qPCR of enterococci are getting at the same thing, albeit with different units of measurement. If the model works with historical culture tests, a model trained on qPCR results should work the same way because it's the underlying relationship between beaches that makes these predictions viable.

This past summer, no new culture test results were collected and only qPCR tests were done (for all 20 beaches). So the pilot model had to take qPCR test results as inputs. We only had 2 years of qPCR results to train on and they were for Calumet, Rainbow, South Shore, 63rd Street, and Montrose. So even though the pilot model has only 2 years of training data, none of the benefits of clustering, and is predicting 15 beaches using the 5 idiosyncratic beaches, it still beat the prior-day USGS model by 3 times. The conclusion is that the pilot supports the viability of the modeling concept.

I have thought about building a middle-layer model that translates the 10 years of culture tests into a predicted qPCR test result, which could then provide 10 years of data to train for new qPCR data is it is collected. I've been wondering if that would work to help with the limited data issue.

kbrose commented 6 years ago

That all makes sense, thanks for explaining that Nick.

I think most of my confusion comes from the Identifying Beach Clusters and Building the Predictive Model sections.

Are these sections describing an ideal model that could not be implemented because of the lack of long-term historical qPCR data? And then the last paragraph in Building the Predictive Model describes the actual model whose performance is described in subsequent sections?

tomschenkjr commented 6 years ago

The analytical model consists of two parts. First, the Identifying Beach Clusters describes the approach to identifying which beaches move in tandem. Then, after that is completed, Building the Predictive Model is building the model which performs the prediction.

kbrose commented 6 years ago

Ok, I think this explanation along with the addition of the MSE figures helped quite a bit. Thanks!