VinaTsai / xgboost_notebook

0 stars 0 forks source link

imbalanced data #1

Open VinaTsai opened 4 years ago

VinaTsai commented 4 years ago

max_delta_step

https://xgboost.readthedocs.io/en/latest/tutorials/param_tuning.html https://xgboost.readthedocs.io/en/latest/parameter.html

image

image

VinaTsai commented 4 years ago

base_score

https://xgboost.readthedocs.io/en/latest/parameter.html https://github.com/dmlc/xgboost/issues/799

image

image

VinaTsai commented 3 years ago

rebalance data sets is not always optimal

重采样

  1. Undersampling methods
  2. Oversampling methods
  3. Synthetic data generation

当使用重采样方法(例如从 y=0 获得的数据多于从 y=1 获得的数据)时,我们在训练过程向分类器显示了两个类的错误比例。以这种方式学得的分类器在未来实际测试数据上得到的准确率甚至比在未改变数据集上训练的分类器准确率还低。实际上,类的真实比例对于分类新的点非常重要,而这一信息在重新采样数据集时被丢失了。

因此,即使不完全拒绝这些方法,我们也应当谨慎使用它们:有目的地选择新的比例可以导出一些相关的方法(下节将会讲),但如果没有进一步考虑问题的实质而只是将类进行重新平衡,那么这个过程可能毫无意义。总结来讲,当我们采用重采样的方法修改数据集时,我们正在改变事实,因此需要小心并记住这对分类器输出结果意味着什么。

添加额外特征

顾名思义,添加新特征,增加区分度。

新思路:重新解决问题

Cost sensitive learning

原目标函数:假设两类错误 False Positive & False Negative 的cost 一致,只考虑准确率。但有些时候,False Negative的成本 >> False Positive。因此,真实案例中的两类错误cost是不一致的。

  1. 理论最小成本(新目标函数):min(期望预测成本)
  2. 概率阈值:分类器的目标函数不变,训练后的 y_hat *weight,根据成本误差调整分类器
  3. 类重新加权(class reweight):训练时考虑成本误差的不对称性(比如重采样),输出概率已嵌入成本误差信息。最后用0.5的阈值作为分类规则。

eg.

  1. 神经网络分类器:理论最小成本
  2. 贝叶斯分类器:类重新加权
    1. 假设:当真实标签为 1 而预测为 0 时的成本为 P01
    2. 假设:当真实标签为 0 而预测为 1 时的成本为 P10
    3. 假设:其中 P01 和 P10 满足:0 <P10 << P01
    4. 对少数类按照 P01/P10 的比例进行过采样(少数类的基数乘以 P01/P10)
    5. 对多数类按照 P10/P01 的比例进行欠采样(多数类的基数乘以 P10/P01)

conclusion

这篇文章的核心思想是:

  1. 当我们使用机器学习算法时,必须谨慎选择模型的评估指标:我们必须使用那些能够帮助更好了解模型在实现目标方面的表现的指标;
  2. 在处理不平衡数据集时,如果类与给定变量不能很好地分离,且我们的目标是获得最佳准确率,那么得到的分类器可能只是预测结果为多数类的朴素分类器;
  3. 可以使用重采样方法,但必须仔细考虑:这不应该作为独立的解决方案使用,而是必须与问题相结合以实现特定的目标;
  4. 重新处理问题本身通常是解决不平衡类问题的最佳方法:分类器和决策规则必须根据目标进行设置。

https://zhuanlan.zhihu.com/p/56960799

VinaTsai commented 3 years ago

https://www.zhihu.com/question/323518703/answer/678887717

VinaTsai commented 3 years ago

https://www.kaggle.com/janiobachmann/credit-fraud-dealing-with-imbalanced-datasets

Processes

  1. scaled features
  2. create sub-samples: randomly get non-fraud cases with the same number of fraud cases
  3. splitting original data into trainting & testing sets: train sub-samples & test original samples
  4. correlation matrices: 4.1 pick up positive & negative variables 4.2 delete (extreme) outliers: Interquartile Range Method by changing different thresholds to see how the thresholds affect the accuracy.
  5. Dimensionality Reduction and Clustering: t-SNE, PCA...
  6. Classifiers(undersampling): accuracy, learning curve, ROC
  7. SMOTE-technique(Over-sampling): do it not before CV (training & testing sets would be new) but during CV (only training set would be new without affecting validation set), since validation set would be influenced by the new set and cause "data leakage" problem -- overfitting.

References

Interquartile Range Method:

  1. Visualize Distributions: visualize the distribution of the feature (high correlations with target variable) , and then use them to eliminate some of the outliers.
  2. Determining the threshold: namely, determine range = q75-q25 & iqr (the lower more outliers removed), and then the upper threshold is q25 - range iqr (lower extreme threshold), the lower threshold is q75 + range iqr (upper extreme threshold).
  3. Conditional Dropping: if the "threshold" is exceeded in both extremes, the instances will be removed.
  4. Boxplot Representation: Visualize through the boxplot that the number of "extreme outliers" have been reduced to a considerable amount.

t-SNE

https://zhuanlan.zhihu.com/p/28967965

Learning Curve

https://blog.csdn.net/m0_37870649/article/details/79810542