Closed slsongge closed 4 years ago
换复杂模型,如xgboost。 增加样本量。 这两个都尝试下。 复现代码可以看一下?
---原始邮件--- 发件人: "Xiaosong.Jin"<notifications@github.com> 发送时间: 2019年11月19日(星期二) 上午10:58 收件人: "JiaxiangBU/tutoring2"<tutoring2@noreply.github.com>; 抄送: "Subscribed"<subscribed@noreply.github.com>; 主题: [JiaxiangBU/tutoring2] 二分类-自变量区分度很低 (#14)
简述问题
在一个样本平衡的二分类(预测某疾病的良性和恶性)问题中,自变量的区分度很低,导致模型效果不好,不知道有没有解决方法?
数据
该数据与https://www.kaggle.com/mirichoi0218/classification-breast-cancer-or-not-with-15-ml/data中的数据及其相似,只不过各自变量的区分度很低。绘制核密度曲线图如下,几乎所有的自变量都是类似下图中箭头所指向的情况:
然后我尝试了一下决策树(删除了近1000行缺失率高的样本,当然这些样本也可以后续做适当的缺失值填补),训练集1000,测试集517个,混淆矩阵如下:
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.
查看迭代过程,这是不过拟合了。
iter | train_auc | test_auc |
---|---|---|
1 | 0.559419 | 0.588944 |
2 | 0.573420 | 0.581689 |
3 | 0.609417 | 0.606254 |
4 | 0.598823 | 0.596964 |
5 | 0.607483 | 0.601888 |
6 | 0.612072 | 0.626301 |
7 | 0.611898 | 0.623196 |
8 | 0.613925 | 0.623500 |
9 | 0.614823 | 0.628464 |
10 | 0.617525 | 0.632278 |
11 | 0.615231 | 0.631768 |
12 | 0.615144 | 0.631504 |
13 | 0.612816 | 0.627841 |
14 | 0.619409 | 0.638008 |
15 | 0.623900 | 0.646101 |
16 | 0.626621 | 0.647370 |
17 | 0.627725 | 0.647912 |
18 | 0.627954 | 0.643938 |
19 | 0.637000 | 0.649764 |
20 | 0.641395 | 0.648064 |
21 | 0.643176 | 0.643914 |
22 | 0.643985 | 0.643906 |
23 | 0.646757 | 0.646093 |
24 | 0.647195 | 0.648247 |
25 | 0.648148 | 0.651863 |
26 | 0.649506 | 0.649796 |
27 | 0.651981 | 0.653914 |
28 | 0.651485 | 0.651775 |
29 | 0.651646 | 0.649413 |
30 | 0.651575 | 0.647785 |
31 | 0.653940 | 0.650642 |
32 | 0.661011 | 0.653642 |
33 | 0.665402 | 0.653259 |
34 | 0.663553 | 0.652908 |
35 | 0.663839 | 0.653211 |
36 | 0.664053 | 0.652517 |
37 | 0.663491 | 0.656340 |
38 | 0.664640 | 0.660953 |
39 | 0.666295 | 0.659660 |
40 | 0.665239 | 0.659564 |
41 | 0.664558 | 0.661926 |
42 | 0.666610 | 0.662277 |
43 | 0.666789 | 0.664719 |
44 | 0.669662 | 0.667497 |
45 | 0.670305 | 0.661415 |
46 | 0.669979 | 0.659931 |
47 | 0.669090 | 0.660123 |
48 | 0.671331 | 0.662900 |
49 | 0.673101 | 0.666188 |
50 | 0.675054 | 0.669252 |
51 | 0.675366 | 0.673291 |
52 | 0.674079 | 0.674017 |
53 | 0.674336 | 0.674129 |
54 | 0.673587 | 0.671886 |
55 | 0.675190 | 0.674440 |
56 | 0.676073 | 0.674679 |
57 | 0.677647 | 0.673769 |
58 | 0.678047 | 0.671359 |
59 | 0.677472 | 0.673961 |
60 | 0.678168 | 0.673594 |
61 | 0.677509 | 0.673179 |
62 | 0.676798 | 0.669891 |
63 | 0.676913 | 0.668007 |
64 | 0.679253 | 0.669667 |
65 | 0.679503 | 0.671120 |
66 | 0.680311 | 0.671407 |
67 | 0.680283 | 0.673307 |
68 | 0.680486 | 0.674057 |
69 | 0.681803 | 0.672923 |
70 | 0.680345 | 0.670705 |
71 | 0.680934 | 0.672445 |
72 | 0.680644 | 0.673163 |
73 | 0.680076 | 0.675573 |
74 | 0.680373 | 0.679420 |
75 | 0.680970 | 0.680266 |
76 | 0.681469 | 0.681638 |
77 | 0.680935 | 0.683713 |
78 | 0.681524 | 0.686283 |
79 | 0.681483 | 0.685150 |
80 | 0.680947 | 0.686778 |
81 | 0.682491 | 0.687767 |
82 | 0.682459 | 0.683522 |
83 | 0.682509 | 0.681120 |
84 | 0.682321 | 0.680872 |
85 | 0.682686 | 0.681223 |
86 | 0.683779 | 0.680904 |
87 | 0.683840 | 0.683410 |
88 | 0.684222 | 0.683841 |
89 | 0.683249 | 0.686203 |
90 | 0.683629 | 0.686123 |
91 | 0.683561 | 0.687464 |
92 | 0.683820 | 0.687033 |
93 | 0.684481 | 0.686858 |
94 | 0.683709 | 0.689555 |
95 | 0.684023 | 0.691614 |
96 | 0.684343 | 0.692364 |
97 | 0.684136 | 0.689044 |
98 | 0.684651 | 0.689124 |
99 | 0.685240 | 0.690305 |
100 | 0.684996 | 0.688853 |
data = xgboost_dtrain,
eta = 0.1,
nround = 100,
max_depth = 3,
min_child_weight = 17,
# gamma = 0.72,
subsample = 0.1,
colsample_bytree = 0.1,
eval.metric = "auc",
objective = "binary:logistic",
seed = 45,
watchlist = watchlist,
nfold = 10,
early_stopping_rounds = 50,
nthread = 8
这是超参数 @slsongge 你可以参考下,过拟合处理了。
@slsongge 迭代过程都是逐步爬行的,OK 的。
简述问题
在一个样本平衡的二分类(预测某疾病的良性和恶性)问题中,自变量的区分度很低,导致模型效果不好,不知道有没有解决方法?
数据
该数据与https://www.kaggle.com/mirichoi0218/classification-breast-cancer-or-not-with-15-ml/data中的数据及其相似,只不过各自变量的区分度很低。绘制核密度曲线图如下,几乎所有的自变量都是类似下图中箭头所指向的情况:
然后我尝试了一下决策树(删除了近1000行缺失率高的样本,当然这些样本也可以后续做适当的缺失值填补),训练集1000,测试集517个,混淆矩阵如下: