Closed kbrose closed 6 years ago
@kbrose - that's right, those 5 beaches were removed from the analysis at the clustering stage, prior to modeling. So, the model does not use their test results as a feature or predict them. But, in the hybrid method as a whole, these beaches are designated to be tested with qPCR. The idea is that they are not good for use in prediction, and account for something like 50% of all exceedances.
After their removal, the remaining beaches were reclustered to get the 5 clusters above.
Ok, so that results in 10 beaches being qPCR tested in total?
Correct.
Ok cool. So is this quote describing the output of that clustering process or is it talking about something else?
The feature beaches were Calumet, Rainbow, South Shore, 63rd Street, and Montrose, and the predicted beaches were the other 15 regularly tested beaches. The model was trained using qPCR test results for 2015 and 2016 and fit to predict the culture-based levels...
This is talking about something else, so I'll explain. To perform the clustering and validate the hybrid model concept, we used the 10 years of culture testing results for 20 beaches, since it is accepted that culture tests of E. coli and qPCR of enterococci are getting at the same thing, albeit with different units of measurement. If the model works with historical culture tests, a model trained on qPCR results should work the same way because it's the underlying relationship between beaches that makes these predictions viable.
This past summer, no new culture test results were collected and only qPCR tests were done (for all 20 beaches). So the pilot model had to take qPCR test results as inputs. We only had 2 years of qPCR results to train on and they were for Calumet, Rainbow, South Shore, 63rd Street, and Montrose. So even though the pilot model has only 2 years of training data, none of the benefits of clustering, and is predicting 15 beaches using the 5 idiosyncratic beaches, it still beat the prior-day USGS model by 3 times. The conclusion is that the pilot supports the viability of the modeling concept.
I have thought about building a middle-layer model that translates the 10 years of culture tests into a predicted qPCR test result, which could then provide 10 years of data to train for new qPCR data is it is collected. I've been wondering if that would work to help with the limited data issue.
That all makes sense, thanks for explaining that Nick.
I think most of my confusion comes from the Identifying Beach Clusters and Building the Predictive Model sections.
Are these sections describing an ideal model that could not be implemented because of the lack of long-term historical qPCR data? And then the last paragraph in Building the Predictive Model describes the actual model whose performance is described in subsequent sections?
The analytical model consists of two parts. First, the Identifying Beach Clusters describes the approach to identifying which beaches move in tandem. Then, after that is completed, Building the Predictive Model is building the model which performs the prediction.
Ok, I think this explanation along with the addition of the MSE figures helped quite a bit. Thanks!
Does this mean the five beaches listed were neither tested nor predicted in the hybrid model? It sounded like they were put into their own two of the five cluster, but then the table does not list them anywhere in the five clusters.