bgreenwell / pdp

A general framework for constructing partial dependence (i.e., marginal effect) plots from various types machine learning models in R.
http://bgreenwell.github.io/pdp
91 stars 12 forks source link

Issues with - in Variable Names #113

Open allicamm opened 3 years ago

allicamm commented 3 years ago

Hi Brandon,

I'm having an issue using pdp for a dataset with dashes in variable names. When I run this line of code: partial(model, train = training_final, pred.var = 'marital_status_Married-civ-spouse', plot = TRUE)

It looks like some code in PDP is losing the quotes for this and hence the variable name is getting cut off at the dash:

Error in eval(expr, envir, enclos) : object 'marital_status_Married' not found

Obviously this could be fixed on my end with changing variable names before creating my model, but figured this might be an issue others run into as well.

Thanks!

bgreenwell commented 3 years ago

Thanks @allicamm ill try to fix this in the next release!

bgreenwell commented 3 years ago

@allicamm Looks like the issue is in plotPartial() (which relies on lattice graphics and is the default plotting engine whenever plot = TRUE). However, partial() and autoplot() work fine:

library(ggplot2)
library(pdp)
library(xgboost)

trn <- vip::gen_friedman(seed = 101)
X <- data.matrix(subset(trn, select = -y))
y <- trn$y

# Add chyphens to feature names
colnames(X) <- paste0(colnames(X), "-", "test")

# Fit a quick model
fit <- xgboost(X, label = y, nrounds = 50)

# Works
pd <- partial(fit, pred.var = "x1-test", train = X, type = "regression")

# Works
autoplot(pd)
partial(fit, pred.var = "x1-test", train = X, type = "regression", plot = TRUE, plot.engine = "ggplot2")

# Fails
plotPartial(pd)
partial(fit, pred.var = "x1-test", train = X, type = "regression", plot = TRUE)  # plot.engine = "lattice" (this is the default)

Might be tough to fix, but I'll work on it soon. Thanks again for pointing out the issue!