ShifuML / shifu

An end-to-end machine learning and data mining framework on Hadoop
https://github.com/ShifuML/shifu/wiki
Apache License 2.0
251 stars 109 forks source link

rebalance data when shuffling data #662

Open huzza opened 5 years ago

huzza commented 5 years ago

An issue, that data scientist suffer a lot, is data unbalance when training models. Sometimes the positive instance are only 2-5% of whole population. If the data could be rebalanced, the model result may be better, and the score distribution will be better.

Usually, there are three ways to rebalance the data: 1. duplicate records for low population; 2. increase the each weight for low population; 3. down-sample the high population

In Shifu, the rebalance function could be put when we shuffling the normalization dataset.