Closed islander22 closed 5 years ago
I have update some arguments to keep or skip columns in the modeling process, see the codes in below. You can update to the latest version package from GitHub via devtools:install_github('shichenxie/scorecard')
.
library(scorecard)
library(data.table)
# data ------
data("germancredit")
dat = setDT(germancredit)[,rowid := .I] # add a rowid column
dt_f = var_filter(dat, y="creditability", var_kp = 'rowid')
dt_list = split_df(dt_f, y="creditability", ratio = 0.6, seed = 30)
label_list = lapply(dt_list, function(x) x$creditability)
# woe binning ------
bins = woebin(dt_f, y="creditability", var_skip = 'rowid')
dt_woe_list = lapply(dt_list, function(x) woebin_ply(x, bins))
# glm ------
m1 = glm( creditability ~ ., family = binomial(), data = dt_woe_list$train)
m_step = step(m1, direction="both", trace = FALSE)
m2 = eval(m_step$call)
# score ------
card = scorecard(bins, m2)
score_list = lapply(dt_list, function(x) scorecard_ply(x, card, var_kp = 'rowid'))
perf_psi(score = score_list, label = label_list, var_skip = 'rowid')
It could be easier to flag our own ID (ex. person_id in my data) from our data, rather than creating a new ROW_ID column I think. (bec. so at the end we will have to join it with our original ID s anyway... also in the middle steps ; to keep the ROW_ID s we have to export the unseperated (train,test) version and then join it with the final table... )
bins = woebin(dt_f, y="creditability", var_skip = 'rowid')
also there is problem about this step,
Error in checkForRemoteErrors(val) : 75 nodes produced errors; first error: Error in data.table(y = dt[[y]], variable = x_i, value = dt[[x_i]]) : "data.table" not found In addition: Warning message: In e$fun(obj, substitute(ex), parent.frame(), e$data) : already exporting variable(s): dt, xs, y, breaks_list, special_values, init_count_distr, count_distr_limit, stop_limit, bin_num_limit, method
The issue should be solved. Try to restart your R core and run the code again.
the below part of code not work : gives the error below :
m1 = glm( creditability ~ ., family = binomial(), data = dt_woe_list$train)
Error in terms.formula(formula, data = data) : duplicated name 'NA' in data frame using '.'
It works well in my local environment. Make sure you have installed the 0.2.3 version package, which has been upload to CRAN today and can be installed via install.package('scorecard')
.
In our model data, we have ID column. (ex: PERSON_ID) (the main distinct ID that we need the scores of)
The package disappears the ID_COLUMN, How can we identify it in the code? How does the code know which column is ID? (it rejects that column at the beginning assuming that it is a feature)
In the code I could not see anywhere to clarify the ID column. (it disappears the ID after the code : dt_sel = var_filter(germancredit, "creditability")
So it causes a problem that, we do not know which score belongs to which PERSON_ID (it just gives rows and scores...)
I hope my question is clear :)
May be that final scorecard code should include the column ID : (or var filter code may include a column ıd like : var_filter(germancredit, "creditability","person_id")
credit score, only_total_score = FALSE score_list2 = lapply(dt_list, function(x) scorecard_ply(x,card, only_total_score=FALSE))
Thanks for that great work!