ShichenXie / scorecardpy

Scorecard Development in python, 评分卡
http://shichen.name/scorecard
MIT License
725 stars 301 forks source link

Example has problem of data leakage #110

Closed nguyenbim closed 1 year ago

nguyenbim commented 1 year ago

In your example, you do the iv filter, woe binning for the whole data before splitting train, test. If I understand correctly. this action causes data leakage because the trainset created after splitting will contain woe data of the whole dataset.

ShichenXie commented 1 year ago

You are right. You can split your real dataset at the very beginning of model training. In this example, the binning result might be unstable when based on such a small train dataset only.