Column ID Disappears - Githubissues

islander22 commented 5 years ago

In our model data, we have ID column. (ex: PERSON_ID) (the main distinct ID that we need the scores of)

The package disappears the ID_COLUMN, How can we identify it in the code? How does the code know which column is ID? (it rejects that column at the beginning assuming that it is a feature)

In the code I could not see anywhere to clarify the ID column. (it disappears the ID after the code : dt_sel = var_filter(germancredit, "creditability")

So it causes a problem that, we do not know which score belongs to which PERSON_ID (it just gives rows and scores...)

I hope my question is clear :)

May be that final scorecard code should include the column ID : (or var filter code may include a column ıd like : var_filter(germancredit, "creditability","person_id")

credit score, only_total_score = FALSE score_list2 = lapply(dt_list, function(x) scorecard_ply(x,card, only_total_score=FALSE))

Thanks for that great work!

ShichenXie commented 5 years ago

I have update some arguments to keep or skip columns in the modeling process, see the codes in below. You can update to the latest version package from GitHub via devtools:install_github('shichenxie/scorecard').

library(scorecard)
library(data.table)

# data ------
data("germancredit")
dat = setDT(germancredit)[,rowid := .I] # add a rowid column
dt_f = var_filter(dat, y="creditability", var_kp = 'rowid') 
dt_list = split_df(dt_f, y="creditability", ratio = 0.6, seed = 30)
label_list = lapply(dt_list, function(x) x$creditability)

# woe binning ------
bins = woebin(dt_f, y="creditability", var_skip = 'rowid')
dt_woe_list = lapply(dt_list, function(x) woebin_ply(x, bins))

# glm ------
m1 = glm( creditability ~ ., family = binomial(), data = dt_woe_list$train)
m_step = step(m1, direction="both", trace = FALSE)
m2 = eval(m_step$call)

# score ------
card = scorecard(bins, m2)
score_list = lapply(dt_list, function(x) scorecard_ply(x, card, var_kp = 'rowid'))

perf_psi(score = score_list, label = label_list, var_skip = 'rowid')

islander22 commented 5 years ago

It could be easier to flag our own ID (ex. person_id in my data) from our data, rather than creating a new ROW_ID column I think. (bec. so at the end we will have to join it with our original ID s anyway... also in the middle steps ; to keep the ROW_ID s we have to export the unseperated (train,test) version and then join it with the final table... )

islander22 commented 5 years ago

bins = woebin(dt_f, y="creditability", var_skip = 'rowid')

also there is problem about this step,

Error in checkForRemoteErrors(val) : 75 nodes produced errors; first error: Error in data.table(y = dt[[y]], variable = x_i, value = dt[[x_i]]) : "data.table" not found In addition: Warning message: In e$fun(obj, substitute(ex), parent.frame(), e$data) : already exporting variable(s): dt, xs, y, breaks_list, special_values, init_count_distr, count_distr_limit, stop_limit, bin_num_limit, method

ShichenXie commented 5 years ago

The issue should be solved. Try to restart your R core and run the code again.

islander22 commented 5 years ago

the below part of code not work : gives the error below :

m1 = glm( creditability ~ ., family = binomial(), data = dt_woe_list$train)

Error in terms.formula(formula, data = data) : duplicated name 'NA' in data frame using '.'

ShichenXie commented 5 years ago

It works well in my local environment. Make sure you have installed the 0.2.3 version package, which has been upload to CRAN today and can be installed via install.package('scorecard').

ShichenXie / scorecard

Column ID Disappears #16