bgreenwell / pdp

A general framework for constructing partial dependence (i.e., marginal effect) plots from various types machine learning models in R.
http://bgreenwell.github.io/pdp
94 stars 12 forks source link

Add support to Sparklyr #97

Closed abrahamdu closed 5 years ago

abrahamdu commented 5 years ago

Hi,

I have large data set for training so put the data in Spark and use Sparklyr for modeling training. How can I use your package to integrate with Sparklyr to plot PDP?

Thanks.

bgreenwell commented 5 years ago

Hi @abrahamdu, computing PDPs for Spark-based ML models is currently out of scope of this package, though I wouldn’t rule it out in a future release. Nonetheless it is quite easy to compute PDPs with sparklyr using a simple join operation combined a single call to a spark scoring function. Do you have a reproducible example? If not, let me know what kind of model are you fitting (e.g., regression or classification with gradient boosting) and I can throw together a simple example for you to use.

abrahamdu commented 5 years ago

I used the code example shown in your paper and tried sparklyr for trial.

library(pdp)
library(randomForest)
library(sparklyr)

data(boston, package = "pdp")

set.seed(101)
boston.rf <- randomForest(cmedv ~ ., data = boston, importance = TRUE)
varImpPlot(boston.rf)

partial(boston.rf, pred.var = "lstat", plot = TRUE)
sc <- spark_connect(master = 'local')
boston_sc <- copy_to(sc, boston, overwrite = TRUE)
boston_rf <- boston_sc %>% as.data.frame()
boston_model <- boston_sc %>% ml_random_forest(cmedv ~ ., type = "auto")
training_result_boston_rf <- ml_predict(boston_model, boston_sc)
partial(boston_model, pred.var = "lstat", train = boston_rf, plot = TRUE, type = "auto")

Not sure though how to use pdp to draw the plot?

Thanks in advance for your help.

bgreenwell commented 5 years ago

You can definitely use pdp with Spark-based ML models by creating a custom prediction wrapper via the pred.fun argument, though this is not optimal. If you're doing your work in Spark, you should do all the PDP computations in Spark as well. This is extrememly simple using sparklyr & dplyr:

# Load required packages
library(dplyr)
library(pdp)
library(sparklyr)

data(boston, package = "pdp")

sc <- spark_connect(master = 'local')
boston_sc <- copy_to(sc, boston, overwrite = TRUE)
rfo <- boston_sc %>% ml_random_forest(cmedv ~ ., type = "auto")

# Define plotting grid 
df1 <- data.frame(lstat = quantile(boston$lstat, probs = 1:19/20)) %>% 
  copy_to(sc, df = .)

# Remove plotting variable from training data
df2 <- boston %>%
  select(-lstat) %>%
  copy_to(sc, df = .)

# Perform a cross join, compute predictions, then aggregate
par_dep <- df1 %>%
  full_join(df2, by = character()) %>%  # cartesian product
  ml_predict(rfo, dataset = .) %>%
  group_by(lstat) %>%  
  summarize(yhat = mean(prediction)) %>%  # average for partial dependence
  select(lstat, yhat) %>%  # select plotting variables
  arrange(lstat) %>%  # for plotting purposes
  collect()

# Plot results
plot(par_dep, type = "l")

image

abrahamdu commented 5 years ago

Thanks. This is similar to what I did manually.