rebalance data when shuffling data

An issue, that data scientist suffer a lot, is data unbalance when training models. Sometimes the positive instance are only 2-5% of whole population. If the data could be rebalanced, the model result may be better, and the score distribution will be better.

Usually, there are three ways to rebalance the data: 1. duplicate records for low population; 2. increase the each weight for low population; 3. down-sample the high population

In Shifu, the rebalance function could be put when we shuffling the normalization dataset.

ShifuML / shifu

rebalance data when shuffling data #662