Problem with Derived Metrics

djhocking commented 8 years ago

I ran the model and the RMSE for the calibration data was 0.633. However, when I ran all the derived metrics for the northeast, I get MANY unrealistic and impossible values. For example thousands of sites have mean July temperatures predicted above 50 C. Some are even above 500 C and 1 is even 5000 C.

There is no obvious landscape characteristic that causes unrealistic predictions. Therefore it must be something either about missing covariates (I think these should result in NA but I will look for a bug in the code) or weather covariates likely when in complex interactions with landscape covariates. I can't output all predictions but I should be able to look through at least a subset based on featureid. It would be easiest if I made the parallel prediction and derived metric code into a function(s), now that it's "working".

This is the result of a potentially overfit model being used to extrapolate outside the observed multivariate space. That in combination with the model being linear allows values to rocket into unrealistic space. My guess is that precipitation would be the biggest problem because it is not normally distributed and there can be times of massive precipitation. If these weren't included in the model fitting then the linear coefficient would be overestimated/inappropriate. This is probably exacerbated in larger streams because there is an interaction of precip and drainage area and another of airTemp*prcp*AreaSqKM.

Assuming precipitation is the problem, I'm not sure the best way to deal with it. I could limit predictions to only use values within the observed range (or observed +/- 10%) and return NA otherwise. I could even do this with all the variables to prevent unrealistic extrapolation, maybe using a larger percent to allow some extrapolation. In theory, we could incorporate some non-linearity or other way of handling precipitation effects but I don't want to do that with this version of the model. That would be a longer term solution.

I could also reduce the complexity of the model, particularly the interactions with precipitation. It would reduce the mechanistic nature of the model since precip should no directly affect water temperature (you can have warm rain or cold rain) but seems like it would affect water temperature by modulating the effect of air temperature and other heat exchange (hence the interaction of precip and drainage area (combined to represent an index of flow) with air temperature. But this is a statistical model and not a mechanistic model, so maybe I shouldn't worry too much. However, if I don't have an interaction with air temperature then precip should really effect temperature in any particular direction and there doesn't seem any point in keeping it in the model.

djhocking commented 8 years ago

Thankfully it turns out that the major problem was in standardizing variables. The new variable impoundArea was added to replace allonnet because allonnet was a percentage and wasn't especially meaningful for stream temperature since the same percentage could represent impoundments of 1 km^2 or 200 km^2. However, the new variable was added to the vector of variables, var.names before modeling but it was overwritten within the predictTemp function and had not been updated. It was not obvious that impoundArea was the problem because of interactions with other variables resulting in problems randomly across all impoundment areas depending on the sign of the coefficient and interacting variable.

djhocking commented 8 years ago

It's still worth considering the amount of extrapolation we're comfortable with. We might want to think about this in relation to the expected changes in temperature and precipitation over the next 50 years.

Conte-Ecology / conteStreamTemperature

Problem with Derived Metrics #34