Open RoelVerbelen opened 2 years ago
Apologies, I found my mistake. I didn't account for the fact that, compared to GBMs, you get standard errors with H2O GLM predictions. Selecting only the first predict
column solves it:
pred.fun <- function(object, newdata) {
mean(as.numeric(as.vector(h2o.predict(object, as.h2o(newdata))[[1]])))
}
pdp
is still a lot slower though than h2o.partialPlot()
due to multiple predict calls instead of one. The in.memory
argument you proposed here still sounds promising.
Thanks for the note @RoelVerbelen, glad you found the issue! The bigger bottleneck here is probably attributed to the multiple calls needed to coerce the data with as.h2o
, so in cases like this I can see the original in.memory
argument providing an advantage. The same idea can be used to compute PD plots in Spark (i.e., with sparkR or sparklyr). In fact, the in.memory
argument effectively generalized the following approach to computing PD plots in Spark with sparklyr, which was suggested in this closed issue:
# Load required packages
library(dplyr)
library(pdp)
library(sparklyr)
data(boston, package = "pdp")
sc <- spark_connect(master = 'local')
boston_sc <- copy_to(sc, boston, overwrite = TRUE)
rfo <- boston_sc %>% ml_random_forest(cmedv ~ ., type = "auto")
# Define plotting grid
df1 <- data.frame(lstat = quantile(boston$lstat, probs = 1:19/20)) %>%
copy_to(sc, df = .)
# Remove plotting variable from training data
df2 <- boston %>%
select(-lstat) %>%
copy_to(sc, df = .)
# Perform a cross join, compute predictions, then aggregate
par_dep <- df1 %>%
full_join(df2, by = character()) %>% # cartesian product
ml_predict(rfo, dataset = .) %>%
group_by(lstat) %>%
summarize(yhat = mean(prediction)) %>% # average for partial dependence
select(lstat, yhat) %>% # select plotting variables
arrange(lstat) %>% # for plotting purposes
collect()
# Plot results
plot(par_dep, type = "l")
You can try this with your h2o example. If you see a significant improvement, then I'd consider revisiting the idea! (Although, I think the original implementation used data.table to do the joins and aggregations).
Hi @bgreenwell, thanks for the response!
Absolutely, it's the conversion from R to H2O (using as.h2o()
each time) that's making it so slow. The actual scoring in H2O is really fast. It's a bottleneck of pdp::partial()
for H2O (and Spark) at the moment. The benefit of doing that conversion only once is massive, already on this silly toy example:
# Built-in H2O does not need to convert df to H2O anymore
system.time({
h2o.partialPlot(prostate_glm, df, "AGE")
})
user system elapsed
0.07 0.00 1.13
# pdp does as many H2O conversations as nrow(pred.grid)
system.time({
pd <- pdp::partial(prostate_glm,
pred.var = "AGE",
pred.grid = pred.grid,
pred.fun = pred.fun,
train = as.data.frame(df))
})
user system elapsed
11.89 2.50 100.55
# Not relying on pdp and only converting once
system.time({
pd_fast <- pred.grid %>%
full_join(
df %>%
as.data.frame() %>%
select(- AGE),
by = character()
) %>%
mutate(yhat = as.vector(h2o.predict(prostate_glm, as.h2o(.))[[1]])) %>%
group_by(AGE) %>%
summarise(yhat = mean(yhat)) %>%
select(AGE, yhat)
})
user system elapsed
0.56 0.00 4.80
I love how general pdp is, so having this in.memory
option available would be a really powerful improvement.
Sorry for the delay @RoelVerbelen, that is a convincing example! I'll reopen the issue, but not sure when I'll get to it. Should be easy to resurrect the old branch and grab the code.
Oddly, I'm not getting sensible results for GLMs using H2O. Effects for continuous factors are incorrectly looking quadratic.
Here's a simple reprex for a Poisson GLM:
Created on 2022-11-03 by the reprex package (v2.0.1)
I'm getting similar issues with Gausian and binomial GLMs. Using version 0.8.1 from CRAN.