Conte-Ecology / conteStreamTemperature

Package for cleaning and analyzing stream daily stream temperature
MIT License
1 stars 1 forks source link

Make Predictions Function More General and Faster #15

Closed djhocking closed 9 years ago

djhocking commented 9 years ago

Currently the summary of the coefficients is slow and the prediction equations are EXTREMELY clunky and would have to be rewritten any time a variable changes. I need to make it more general using matrix multiplication, paralleling the inprod function in JAGS.

djhocking commented 9 years ago

My first attempt will look something like:

avgCoefs <- function(ggs.obj) {
  library(dplyr)
  means <- ggs.obj %>%
    group_by(Parameter = as.character(Parameter))%>%
    summarise(mean=mean(value), sd=sd(value))
}

B.0 <- avgCoefs(ggs.ar1.B.0)

Parameter <- NULL
for(k in 1:K.0){
  Parameter[k] <- paste0('B.0[',k,']')
}
foo <- data.frame(fixed.names = names(data.fixed), Parameter, Order = seq(1, ncol(data.fixed)))
foo$Parameter <- as(foo$Parameter, "character")

str(B.0)
str(foo)

B.fixed <- arrange(left_join(B.0, foo, by = "Parameter"), Order)

# Then Predict:
b1 %*% t(X1) + b2 %*% t(W2)

Pred <- as.numeric(B.fixed$mean %*% t(data.fixed) + B.site$mean %*% t(data.random.sites))

The challenge will be to loop through the specific sites and years and the deployments to appropriately include site, huc, , year and autoregressive terms.

The specific sites/hucs/years could be handled by appending (using left_join function in dplyr) zeros for all sites/hucs/years that don't have a corresponding one in the calibration data set.

bletcher commented 9 years ago

This looks like a good outline. Will be some issues in the details, of course. I just pushed to -westbrook a dynamic way to do the predictions for the cubic terms that uses regular expressions to pull out the indexing. This may be useful here. Look in the chunk on 'predictions of cubic by year'

djhocking commented 9 years ago

I spent far too long trying to get predictions that parallel the form in JAGS and it's still not working. This is currently being tested in the northeast repo in the 5-Summary.Rmd file.

library(rjags)
ar1.mat <- as.matrix(M.ar1)
ar1.mat2 <- apply(ar1.mat, 2, mean)

extractCoef <- function (means.mat, data, family.name, groupIDs = NULL, rand.spec = FALSE) {
  require(dplyr)
  name.vect <- as.character(names(means.mat))
  df <- data.frame(Parameter = name.vect, Mean = means.mat)
  B <- dplyr::filter(df, grepl(family.name, Parameter))
  B$Parameter <- as.character(B$Parameter)
  K <- dim(data)[2]
  if(rand.spec) {
    M <- length(groupIDs)
    Parameter <- NULL
    #Parameter <- matrix(NA, M, K)
    foo <- list()
    for(m in 1:M) {
      for(k in 1:K) {
        Parameter[k] <- paste0(family.name,'[',m,',',k,']')
        }
      temp <- data.frame(param.names = names(data), Parameter, Order = seq(1, ncol(data)))
      temp$Parameter <- as.character(temp$Parameter)
      temp <- left_join(temp, B, by = "Parameter")
      temp.name <- paste0(groupIDs[m])
      foo[[temp.name]] <- temp
      }
    } else {
      Parameter <- NULL
      foo <- NULL
      for(k in 1:K) {
        Parameter[k] <- paste0(family.name,'[',k,']')
        }
      foo <- data.frame(param.names = names(data), Parameter, Order = seq(1, ncol(data)))
      foo$Parameter <- as(foo$Parameter, "character")
      #coef.avg <- avgCoefs(ggs.obj = ggs.obj)
      foo <- arrange(left_join(foo, B, by = "Parameter"), Order)
      }

  return(foo)
  }

B.fixed <- extractCoef(ar1.mat2, data.fixed, "B.0")

hucs <- unique(tempDataSyncS$HUC8)
B.huc <- extractCoef(means.mat = ar1.mat2, data = data.random.sites, family.name = "B.huc", groupIDs = hucs, rand.spec = TRUE)

sites <- unique(tempDataSyncS$site)
B.site <- extractCoef(ar1.mat2, data.random.sites, "B.site", groupIDs = sites, rand.spec = TRUE)

years <- as.character(unique(tempDataSyncS$year))
B.year <- extractCoef(ar1.mat2, data.random.years, "B.year", groupIDs = years, rand.spec = TRUE)

# Then Predict:
system.time(for(i in 1:dim(data.fixed)[1]) {
  tempDataSync$Pred[i] <- as.numeric(B.fixed$Mean %*% t(data.fixed[i, ]) + B.huc[[tempDataSyncS$HUC8[i]]]$Mean %*% t(data.random.sites[i, ]) + B.site[[tempDataSyncS$site[i]]]$Mean %*% t(data.random.sites[i, ]) + B.year[[as.character(tempDataSyncS$year[i])]]$Mean %*% t(data.random.years[i, ]))
  ############ Check order of years!!!!! ##############
  }) # SLOW (40s for just obseved MA sites and days) - but not sure how to do this without for loop and maintaining matrix multiplication like in JAGS - hard to vectorize with nested random effects

ggplot(tempDataSync, aes(temp, Pred)) + geom_point() # not working, the order could be messed up somewhere
djhocking commented 9 years ago

The problem may have to do with filtering and left_joins that shuffle the order of things so that rows no longer line up correctly.

djhocking commented 9 years ago

I still can't get the predict function working. Any help would be appreciated!!!!

https://github.com/Conte-Ecology/conteStreamTemperature_northeast/blob/master/5-model-summary.Rmd

djhocking commented 9 years ago

The predictions are always roughly right (along 1:1 line with observed) but more variable than predictions from JAGS.

Solution/Debug: Deconstruct the function and build up slowly by adding just 1 set of covariates at a time.

djhocking commented 9 years ago

I took out all the random effects and just kept the fixed effects (B.0, and X.0) and was able to recover the predictions perfectly (JAGS and R predictions matched exactly). Next add in the year random effects.

djhocking commented 9 years ago

The problem was that I was using the residuals that included yesterday's AR rather than the residuals from the trend without the AR1. It works perfectly now.