Prediction function behaving badly

djhocking commented 9 years ago

When I run the code

prepPredictDF <- function(data, coef.list, cov.list, var.name) {
  B <- prepConditionalCoef(coef.list = coef.list, cov.list = cov.list, var.name = var.name)
    B[ , var.name] <- as.character(B[ , var.name])
    data[ , var.name] <- as.character(data[ , var.name])
    df <- left_join(data, B, by = var.name) # merge so can apply/mutate by rows without a slow for loop
    df[ , names(B[-1])][is.na(df[ , names(B[-1])])] <- colMeans(B[-1]) # replace NA with mean
  return(df)
}

It seems to work correctly when var.name = "site" (or huc) but not when var.name = "year". It works correctly when the year is in data and B but not when that particular year is missing from B (i.e. trying to predict to unobserved years).

I believe that it's a problem with this line

df[ , names(B[-1])][is.na(df[ , names(B[-1])])] <- colMeans(B[-1])

Rather than the mean of column 2 in B replacing all the NA in the correspondingly named column in df it repeats the values in B down each selected column in df.

> colMeans(B[-1])
intercept.year.B.year            dOY.B.year           dOY2.B.year           dOY3.B.year 
         -0.004604275           0.040347204          -1.662684249           0.025385473 

> foo
Source: local data frame [1,853,710 x 4]

   intercept.year.B.year   dOY.B.year  dOY2.B.year  dOY3.B.year
1           -0.004604275 -0.004604275 -0.004604275 -0.004604275
2            0.040347204  0.040347204  0.040347204  0.040347204
3           -1.662684249 -1.662684249 -1.662684249 -1.662684249
4            0.025385473  0.025385473  0.025385473  0.025385473
5           -0.004604275 -0.004604275 -0.004604275 -0.004604275
6            0.040347204  0.040347204  0.040347204  0.040347204
7           -1.662684249 -1.662684249 -1.662684249 -1.662684249
8            0.025385473  0.025385473  0.025385473  0.025385473
9           -0.004604275 -0.004604275 -0.004604275 -0.004604275
10           0.040347204  0.040347204  0.040347204  0.040347204
..                   ...          ...          ...          ...

Those first 10 rows in intercept.year.B.year should all be -0.0046 and all shown in dOY.B.year should be 0.040347204, etc.

This code is left over from when I was replacing all the values with 0 (which worked) but then I realized that the random effects weren't all centered on 0 so I have to replace them by the mean.

Is there a good way to do this when I don't know what the names of the columns will be in advance and the length with change depending on the covariates in the model? The names and lengths will always be available in cov.list and coef.list.

Sorry if this is confusing. I was realizing what the underlying problem was as I was writing it.

djhocking commented 9 years ago

This seems to work

    for(i in 2:length(names(B))) {
      df[ , names(B[i])][is.na(df[ , names(B[i])])] <- colMeans(B[i])
    }

I have been trying to move away from using for() loops in R since they are so slow, but in this case the loop would never be longer than ~10 (the number of covariates in the model) so maybe it's not a problem.

bletcher commented 9 years ago

did you try ifelse()?. it's vectorised.

On Tue, Nov 18, 2014 at 9:45 PM, Daniel J. Hocking <notifications@github.com

wrote:

This seems to work
for(i in 2:length(names(B))) {
  df[ , names(B[i])][is.na(df[ , names(B[i])])] <- colMeans(B[i])
}
I have been trying to move away from using for() loops in R since they are so slow, but in this case the loop would never be longer than ~10 (the number of covariates in the model) so maybe it's not a problem.

— Reply to this email directly or view it on GitHub https://github.com/Conte-Ecology/conteStreamTemperature/issues/21#issuecomment-63583982 .

Silvio O. Conte Anadromous Fish Research Center, U.S. Geological Survey P.O. Box 796 -- One Migratory Way Turners Falls, MA 01376 (413) 863-3803 Cell: (413) 522-9417 FAX (413) 863-9810

ben_letcher@usgs.gov bletcher@eco.umass.edu http://www.lsc.usgs.gov/?q=cafb-research

Conte-Ecology / conteStreamTemperature

Prediction function behaving badly #21