ShichenXie / scorecardpy

Scorecard Development in python, 评分卡
http://shichen.name/scorecard
MIT License
725 stars 301 forks source link

sc.woebin_ply for a Single Test Record? #25

Closed hanzigs closed 5 years ago

hanzigs commented 5 years ago

Is it possible to apply the sc.woebin_ply function for a SINGLE test record using the train bins, it works in R, in python data frame becomes empty saying unique values, in R it converts based on the training bins.

train_bins = sc.woebin(training_data, y="Target", breaks_list=breaks_adj) test_data = sc.woebin_ply(test_data, train_bins)

Thanks

ShichenXie commented 5 years ago

It should be work.

Toon6115 commented 5 years ago

I also face the same problem in python. When I run single record it shows error like this "None of [Index(['XXX'], dtype='object')] are in the [columns]". Seems it cannot find the index.

But when I run two records, it works.

hanzigs commented 5 years ago

Actually it's not working, doesn't understand meaning of should work, there is also another package 'creditR' does the same job.

Toon6115 commented 5 years ago

I have already fixed it. Use the version below.

def scorecard_ply(dt, card, only_total_score=True, print_step=0):

dt = dt.copy(deep=True)
# remove date/time col
#dt = rmcol_datetime_unique1(dt) #It doesn't work properly. Remove by TSP
# replace "" by NA
dt = rep_blank_na(dt)
# print_step
print_step = check_print_step(print_step)
# card # if (is.list(card)) rbindlist(card)
if isinstance(card, dict):
    card_df = pd.concat(card, ignore_index=True)
# x variables
xs = card_df.loc[card_df.variable != 'basepoints', 'variable'].unique()
# length of x variables
xs_len = len(xs)
# initial datasets
dat = dt.loc[:,list(set(dt.columns)-set(xs))]

# loop on x variables
for i in np.arange(xs_len):
    x_i = xs[i]
    if print_step>0 and bool((i+1)%print_step): 
        print('step',print_step)
        print(('{:'+str(len(str(xs_len)))+'.0f}/{} {}').format(i, xs_len, x_i))

    cardx = card_df.loc[card_df['variable']==x_i]
    # score transformation
    dtx_points = woepoints_ply1(dt, cardx, x_i, woe_points="points")
    dat = pd.concat([dat, dtx_points], axis=1)

# set basepoints
card_basepoints = list(card_df.loc[card_df['variable']=='basepoints','points'])[0] if 'basepoints' in card_df['variable'].unique() else 0
# total score
dat_score = dat[xs+'_points']
dat_score.loc[:,'score'] = card_basepoints + dat_score.sum(axis=1)
# dat_score = dat_score.assign(score = lambda x: card_basepoints + dat_score.sum(axis=1))
# return
if only_total_score: dat_score = dat_score[['score']]
return dat_score
hanzigs commented 5 years ago

Awesome, Thank you very much for your efforts

hanzigs commented 5 years ago

Hi, Thanks for the new version function def scorecard_ply(dt, card, only_total_score=True, print_step=0):

the package is not updated, am I correct, if I do pip install scorecardpy pip install git+git://github.com/shichenxie/scorecardpy.git both these are not updated versions, as with this, it is still not working

hanzigs commented 5 years ago

Hi @ShichenXie @Toon6115 is it possible to raise a PR for the new code and merge. Thanks

ShichenXie commented 5 years ago

I have updated the package from github.

hanzigs commented 5 years ago

Hi @ShichenXie I installed from github, below is the summary

Name: scorecardpy
Version: 0.1.7.4
Summary: Credit Risk Scorecard
Home-page: http://github.com/shichenxie/scorecardpy
Author: Shichen Xie
Author-email: xie@shichen.name
License: UNKNOWN
Location: c:\programdata\anaconda3\lib\site-packages
Requires: scikit-learn, numpy, pandas, matplotlib
Required-by: 
Note: you may need to restart the kernel to use updated packages.

Still i am getting the Error as

data_woe = scorecardpy.woebin_ply(test_data, Training_Bins)

[INFO] converting into woe values ...
C:\ProgramData\Anaconda3\lib\site-packages\scorecardpy\condition_fun.py:34: UserWarning: There are 57 columns have only one unique values, which are removed from input dataset. 
 (ColumnNames: .............................................)
  warnings.warn("There are {} columns have only one unique values, which are removed from input dataset. \n (ColumnNames: {})".format(len(unique1_cols), ', '.join(unique1_cols)))
ShichenXie commented 5 years ago

Sorry for the late reply.

Please update the package from GitHub again. The bug should be fixed. If you still have this issue, please provide a reproducible example.