Choose different variables and try PCA and LDA

nicklucius commented 7 years ago

@CallinOsborn here is the branch of @kbrose repo that has the PCA filters code that produces the results below. Maybe this can help you. We can talk more tomorrow. 2016-10-16t13_02_37_072235--roc

nicklucius commented 7 years ago

@CallinOsborn is attempting to evaluate the above code's performance on 2016 data.

It appears that PCA trains on all years (weather and e. coli data) simultaneously without leaving a year out. This is done as a preprocessing step before the model is created.

Then the PCA is used to filter the training data that are given to the model during training and leave-one-year-out validation. There are concerns about the PCA training on the same data that the model is validated on. Since the PCA is run as a preprocessing step, and the results are inputted during model training, there could be a problem where the model is given the answer to the validation data while training.

Ultimately, if we can validate the model against 2016 data while making sure none of the 2016 data is provided during PCA training or model training, we can see real-world predictive performance.

kbrose commented 7 years ago

Honestly don't remember writing any PCA code at all, but if it was not done with a leave-one-year out approach then it's going to be next to worthless.

nicklucius commented 7 years ago

I am going to try to leave a year out during PCA training to see how that affects performance on 2006-2015. Looks like line 68 is where I can pass the year as an argument to prepare_data():

67       for yr in range(start, stop+1):
68           predictors, meta_info = prepare_data(df)
69           timestamps = meta_info['Full_date']
70           classes = meta_info['Escherichia.coli'] > 235

. . .

115  def prepare_data(df=None, leaveout=None):
116      '''
117      Preps the data to be used in the model. Right now, the code itself must
118      be modified to tweak which columns are included in what way.
119      Parameters
120      ----------
121      df       : Dataframe to use. If not specified, the dataframe is
122               loaded automatically.
123      leaveout : If not None, then this is an integer specifying which
124               year should be left out. Used for preparations that
125               use information across all years.

nicklucius commented 7 years ago

@kbrose I think it was code you borrowed from someone else to run to see if you could replicate results. Your repo is the only place we could find the PCA filters code in a runnable format. Unfortunately, the Slack channel history is gone so we are trying to resurrect some ideas the team had developed back in the spring.

nicklucius commented 7 years ago

I haven't been able to get this PCA code running with leave-one-year-out validation. Adding a leaveout as an argument to prepare_data() breaks the code and after chasing down the errors I found that each fix caused another error downstream.

For the weather-only dataset, we have not found a viable or reproducible way to improve on the USGS/EPA model. I don't see a reason to spend a lot of time chasing down these errors because the result will likely be similar to what we've seen so far.

I suggest that we check 2016 results from the model that uses water-sensor data in addition to weather data. It beat the USGS/EPA model (albeit on a very limited sample of data from 2014-2015). It would be interesting to see how it does on 2016 trained on 2014-2015 data.

The red is the USGS/EPA results on this sample, and the blue is our model's results. The black line is 5% false positive.

CallinOsborn commented 7 years ago

Do you know what the precision was on the one that uses water-sensor data?

nicklucius commented 7 years ago

The TPR/recall on the ROC is about 50% at the 5% FPR rate. At that level of recall, the precision is around 60-70%. Keep in mind this is a small sample so expect a large margin of error.

CallinOsborn commented 7 years ago

I have uploaded my work for the week here.

A couple of observations were noticed as I was looking into possible factors.

More sensors are needed to be placed in the South, that is where we are having the hardest time with E. coli.
There is something with precipIntensity and E. Coli levels
North and North East facing beaches and those that are in an inlet are more prone to getting E. coli
There is something that happened August 30/31 that the E. coli went up big time in all beaches then, what was the difference?

nicklucius commented 7 years ago

Water Sensor Model 2016 Results

For 2016, water sensor data available on the portal decreased from 6 sensors to 3 sensors. This limited the amount of data available for modeling.

Year	Water Sensor Available Data
2016	Calumet, Ohio, Montrose
2015	63rd, Calumet, Ohio, Osterman, Montrose, Rainbow
2014	63rd, Calumet, Ohio, Osterman, Montrose, Rainbow

Due to this limitation, I tried two methods.

Assign one of the 3 water sensors to each beach
Model and predict only these three beaches

In both cases, I trained a random forest model on 2014 - 2015 data, and validated it using 2016 data. For predictors, I tried many variations using both weather and water sensor data. Many versions of the model used handpicked combinations, but I also performed LDA to choose predictors. I built models using 5 - 10 of the variables with the largest positive and negative coefficients from my LDA results.

To evaluate results, I looked at PR curves and ROCs of both my model and the USGS/EPA model to determine results across all sensible thresholds.

Neither method produced a model that could match the results of the USGS/EPA model on the same 2016 validation data.

Weather and Water Data Mining

Once I had 2016 data into the R code, I retried PCA and LDA on all the data we have. Using PCA results to choose predictors has not had a good track record for this project. But using LDA to find predictors has usually helped me create incrementally better models.

LDA results have been consistent across many datasets, including:

2006 - 2015 Weather Only
2006 - 2016 Weather Only
2014 - 2015 Weather and Water Sensor Data
2014 - 2016 Weather and Water Sensor Data

Interestingly, LDA consistently shows that 2 variables in our data have the most predictive power for E. Coli level:

Year
1.daysPrior.precipIntensity

The coefficient difference is pronounced, as you can see in this plot below. The few points that are very far from 0 on the Y axis each represent the Year or 1.daysPrior.precipIntensity.

Yearis interesting because there are clear differences in both E. Coli levels and prediction results from year to year. There must be something that changes from year to year that affects E. Coli levels, but it does not appear to be captured by our weather or water data.

1.daysPrior.precipIntensity is interesting because it has been long suspected that intense rains can wash live E. Coli located on beach sands and other land into the lake. In fact, one model I built using 1.daysPrior.precipIntensity and only two other predictors (2.daysPrior.precipIntensity and Beach_code*) produced the best results we have for predicting 2016 data. But still, the USGS/EPA model performs slightly better at the most important thresholds.

The BLUE is my model (with the mentioned 3 predictors) and the RED is the USGS/EPA model:

Next Steps

What we've learned is that most of the weather and water data we have collected is not terribly predictive. I believe the most important goals going forward should be:

Organizing the code into clear sections to make the following steps easier and more flexible
1. Data collection and cleaning
2. Data preprocessing for modelling, such as filtering and subsetting
3. Feature engineering
4. Data mining
5. Predictor selection
6. Modelling
7. ROC and PR Curve Production
8. Confusion Matrix Production
Exploring additional datasets that capture environmental factors that change from year to year and could affect E. Coli rates.

I am interested in how water levels might affect E. Coli levels. Water levels have changed so much that beaches have disappeared. The USGS publishes data online that gives Lake Michigan levels going back to the 1990s. Here is an graphed example of their water height data:

Of course they are missing data on our outlier year, 2007!

I am also interested in exploring the beach characteristics that may be correlated with E. Coli levels, as @CallinOsborn has observed.

*Beach_code is a number between 1 and 20 assigned to each beach. This gives the model a numeric representation for each beach.

nicklucius commented 7 years ago

Since the original issue is resolved, I am closing this and starting new issues: #107 and #108.

Chicago / clear-water