Conte-Ecology / conteStreamTemperature

Package for cleaning and analyzing stream daily stream temperature
MIT License
1 stars 1 forks source link

predict function #18

Closed djhocking closed 9 years ago

djhocking commented 9 years ago

I am trying to convert a prediction function (for predictions conditional on the specific random effects) from a slow for() loop to a vectorized version. Prediction will have to be done in piecemeal (chunks) or by larger drainage areas only in the future. Right now I want to predict for each day of the daymet record (1980-2013) for the sites in MA with some observed data. That dataframe has 1.8 million rows. The for loop works and uses about 10 GB of RAM. However, it takes about a day to run.

The general idea is as follows:

  pred.test <- NA
  for(i in 1:100){
    pred.test[i] <-  as.matrix(select(df[i, ], one_of(cov.list$site.ef))) %*% as.matrix(t(select(df[i, ], one_of(names(B.site.wide[-1])))))
  }

That works with the dataframe df indexed by row in the for loop. However if I try to apply the function to every row without use of a for loop with either

 pred.test <- apply(df, MARGIN = 1, FUN = as.matrix(select(df, one_of(cov.list$site.ef))) %*% as.matrix(t(select(df, one_of(names(B.site.wide[-1]))))))

or

  Pred <- mutate(df, pred.test = as.matrix(select(df, one_of(cov.list$site.ef))) %*% as.matrix(t(select(df, one_of(names(B.site.wide[-1]))))))

I get the error:

Error: cannot allocate vector of size 25602.0 Gb
In addition: Warning messages:
1: In mutate_impl(.data, dots) :
  Reached total allocation of 32725Mb: see help(memory.size)

Clearly it should not take 25,600 GB to do this. I think it is trying to do every combination of rows or something. Any suggestions on how to vectorize this or apply matrix multiplication based on 2 sets of columns in each row without a for loop?

bletcher commented 9 years ago

Does this by itself work?.

pred <- as.matrix(select(df, one_of(cov.list$site.ef))) %*% as.matrix(t(select(df, one_of(names(B.site.wide[-1])))))

If it does, then you could just cbind or merge the result back into df.

If that doesn't work, then I would try doing the selects outside of the %*% step

djhocking commented 9 years ago

Same error with both of your suggestions

  m1 <- as.matrix(select(df, one_of(cov.list$site.ef)))
  m2 <- as.matrix(t(select(df, one_of(names(B.site.wide[-1])))))
  Pred <- m1 %*% m2

m1 and m2 are correct. I'll search for vectorized matrix multiplication in R.

djhocking commented 9 years ago

This "works" but replicates the results 10 times (instead of producing a vector it produces a matrix with each row being the same).

  m1 <- as.matrix(select(df[1:10, ], one_of(cov.list$site.ef)))
  m2 <- as.matrix(t(select(df[1:10, ], one_of(names(B.site.wide[-1])))))
  (Pred2 <- apply(m1, 1, "%*%", m2))

I guess this is really what I'm trying to do

mat1 <- matrix(1:10, nrow=5, ncol=2)
mat2 <- matrix(1:5, nrow=5, ncol=2)
vect <- NA
for(i in 1:nrow(mat1)){
  vect[i] <-  sum(mat1[i, ] * t(mat2[i, ]))
}

but without the for() loop

walkerjeffd commented 9 years ago

what about rowSums(mat1 * mat2)?

walkerjeffd commented 9 years ago

this also works: diag(mat1 %*% t(mat2)) but is probably inefficient since it computes so many elements that you don't need

djhocking commented 9 years ago

Wow!!! rowSums amazing. It did it in a fraction of a second. Thanks Jeff, you're a life saver.