Chicago / clear-water

Forecasting elevated levels of E. coli at Chicago beaches to provide proper warning to beach-goers.
http://chicago.github.io/clear-water
55 stars 43 forks source link

Choose different variables and try PCA and LDA #105

Closed PriyaDoIT closed 7 years ago

nicklucius commented 7 years ago

@CallinOsborn here is the branch of @kbrose repo that has the PCA filters code that produces the results below. Maybe this can help you. We can talk more tomorrow. 2016-10-16t13_02_37_072235--roc

nicklucius commented 7 years ago

@CallinOsborn is attempting to evaluate the above code's performance on 2016 data.

It appears that PCA trains on all years (weather and e. coli data) simultaneously without leaving a year out. This is done as a preprocessing step before the model is created.

Then the PCA is used to filter the training data that are given to the model during training and leave-one-year-out validation. There are concerns about the PCA training on the same data that the model is validated on. Since the PCA is run as a preprocessing step, and the results are inputted during model training, there could be a problem where the model is given the answer to the validation data while training.

Ultimately, if we can validate the model against 2016 data while making sure none of the 2016 data is provided during PCA training or model training, we can see real-world predictive performance.

kbrose commented 7 years ago

Honestly don't remember writing any PCA code at all, but if it was not done with a leave-one-year out approach then it's going to be next to worthless.

nicklucius commented 7 years ago

I am going to try to leave a year out during PCA training to see how that affects performance on 2006-2015. Looks like line 68 is where I can pass the year as an argument to prepare_data():

67       for yr in range(start, stop+1):
68           predictors, meta_info = prepare_data(df)
69           timestamps = meta_info['Full_date']
70           classes = meta_info['Escherichia.coli'] > 235

. . .

115  def prepare_data(df=None, leaveout=None):
116      '''
117      Preps the data to be used in the model. Right now, the code itself must
118      be modified to tweak which columns are included in what way.
119      Parameters
120      ----------
121      df       : Dataframe to use. If not specified, the dataframe is
122               loaded automatically.
123      leaveout : If not None, then this is an integer specifying which
124               year should be left out. Used for preparations that
125               use information across all years.
nicklucius commented 7 years ago

@kbrose I think it was code you borrowed from someone else to run to see if you could replicate results. Your repo is the only place we could find the PCA filters code in a runnable format. Unfortunately, the Slack channel history is gone so we are trying to resurrect some ideas the team had developed back in the spring.

nicklucius commented 7 years ago

I haven't been able to get this PCA code running with leave-one-year-out validation. Adding a leaveout as an argument to prepare_data() breaks the code and after chasing down the errors I found that each fix caused another error downstream.

For the weather-only dataset, we have not found a viable or reproducible way to improve on the USGS/EPA model. I don't see a reason to spend a lot of time chasing down these errors because the result will likely be similar to what we've seen so far.

I suggest that we check 2016 results from the model that uses water-sensor data in addition to weather data. It beat the USGS/EPA model (albeit on a very limited sample of data from 2014-2015). It would be interesting to see how it does on 2016 trained on 2014-2015 data.

The red is the USGS/EPA results on this sample, and the blue is our model's results. The black line is 5% false positive.

image

CallinOsborn commented 7 years ago

Do you know what the precision was on the one that uses water-sensor data?

nicklucius commented 7 years ago

The TPR/recall on the ROC is about 50% at the 5% FPR rate. At that level of recall, the precision is around 60-70%. Keep in mind this is a small sample so expect a large margin of error.

image

CallinOsborn commented 7 years ago

I have uploaded my work for the week here.

A couple of observations were noticed as I was looking into possible factors.

nicklucius commented 7 years ago

Water Sensor Model 2016 Results

For 2016, water sensor data available on the portal decreased from 6 sensors to 3 sensors. This limited the amount of data available for modeling.

Year Water Sensor Available Data
2016 Calumet, Ohio, Montrose
2015 63rd, Calumet, Ohio, Osterman, Montrose, Rainbow
2014 63rd, Calumet, Ohio, Osterman, Montrose, Rainbow

Due to this limitation, I tried two methods.

  1. Assign one of the 3 water sensors to each beach
  2. Model and predict only these three beaches

In both cases, I trained a random forest model on 2014 - 2015 data, and validated it using 2016 data. For predictors, I tried many variations using both weather and water sensor data. Many versions of the model used handpicked combinations, but I also performed LDA to choose predictors. I built models using 5 - 10 of the variables with the largest positive and negative coefficients from my LDA results.

To evaluate results, I looked at PR curves and ROCs of both my model and the USGS/EPA model to determine results across all sensible thresholds.

Neither method produced a model that could match the results of the USGS/EPA model on the same 2016 validation data.

Weather and Water Data Mining

Once I had 2016 data into the R code, I retried PCA and LDA on all the data we have. Using PCA results to choose predictors has not had a good track record for this project. But using LDA to find predictors has usually helped me create incrementally better models.

LDA results have been consistent across many datasets, including:

Interestingly, LDA consistently shows that 2 variables in our data have the most predictive power for E. Coli level:

The coefficient difference is pronounced, as you can see in this plot below. The few points that are very far from 0 on the Y axis each represent the Year or 1.daysPrior.precipIntensity.

image

Yearis interesting because there are clear differences in both E. Coli levels and prediction results from year to year. There must be something that changes from year to year that affects E. Coli levels, but it does not appear to be captured by our weather or water data.

1.daysPrior.precipIntensity is interesting because it has been long suspected that intense rains can wash live E. Coli located on beach sands and other land into the lake. In fact, one model I built using 1.daysPrior.precipIntensity and only two other predictors (2.daysPrior.precipIntensity and Beach_code*) produced the best results we have for predicting 2016 data. But still, the USGS/EPA model performs slightly better at the most important thresholds.

The BLUE is my model (with the mentioned 3 predictors) and the RED is the USGS/EPA model:

image

Next Steps

What we've learned is that most of the weather and water data we have collected is not terribly predictive. I believe the most important goals going forward should be:

I am interested in how water levels might affect E. Coli levels. Water levels have changed so much that beaches have disappeared. The USGS publishes data online that gives Lake Michigan levels going back to the 1990s. Here is an graphed example of their water height data:

image

Of course they are missing data on our outlier year, 2007!

I am also interested in exploring the beach characteristics that may be correlated with E. Coli levels, as @CallinOsborn has observed.

*Beach_code is a number between 1 and 20 assigned to each beach. This gives the model a numeric representation for each beach.

nicklucius commented 7 years ago

Since the original issue is resolved, I am closing this and starting new issues: #107 and #108.