R语言实现之决策树

决策树是附加概率结果的一个树状的决策图，是直观的运用统计概率分析的图法。机器学习中决策树是一个预测模型，它表示对象属性和对象值之间的一种映射，树中的每一个节点表示对象属性的判断条件，其分支表示符合节点条件的对象。树的叶子节点表示对象所属的预测结果。

这一节学习使用包party里面的函数ctree()为数据集iris建立一个决策树。属性Sepal.Length（萼片长度）、Sepal.Width（萼片宽度）、Petal.Length（花瓣长度）以及Petal.Width（花瓣宽度）被用来预测鸢尾花的Species（种类）。在这个包里面，函数ctree()建立了一个决策树，predict()预测另外一个数据集。

在建立模型之前，iris（鸢尾花）数据集被分为两个子集：训练集（70%）和测试集（30%）。使用随机种子设置固定的随机数，可以使得随机选取的数据是可重复利用的。

导入构建决策树所需要的库


>library("party")#导入数据包

查看本次构建决策树所用的数据源

>str(iris)#集中展示数据文件的结构
'data.frame': 150 obs. of 5 variables: 150条观测值，5个变量
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Call function ctree to build a decision tree. The first parameter is a formula, which defines a target variable and a list of independent variables.

函数构建决策树

>iris_ctree<-ctree(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data=iris)

查看决策树的具体信息

>print(iris_ctree)

决策树案例拟合图

Conditional inference tree with 4 terminal nodes

Response:  Species 
Inputs:  Sepal.Length, Sepal.Width, Petal.Length, Petal.Width 
Number of observations:  150 

1) Petal.Length <= 1.9; criterion = 1, statistic = 140.264
  2)*  weights = 50 
1) Petal.Length > 1.9
  3) Petal.Width <= 1.7; criterion = 1, statistic = 67.894
    4) Petal.Length <= 4.8; criterion = 0.999, statistic = 13.865
      5)*  weights = 46 
    4) Petal.Length > 4.8
      6)*  weights = 8 
  3) Petal.Width > 1.7
    7)*  weights = 46

绘制构建完的决策树图

>plot(iris_ctree)

未剪枝的决策树图

# 对该决策树进行适当的剪枝，防止过拟合，使得树能够较好地反映数据内在的规律并在实际应用中有意义
>plot(iris_ctree, type="simple")

决策树案例拟合图

在图1中，每一个叶子的节点的条形图都显示了观测值落入三个品种的概率。在图2中，这些概率以每个叶子结点中的y值表示。例如：结点2里面的标签是“n=50 y=(1,0,0)”，指的是这一类中一共有50个观测值，并且所有的观测值的类别都属于第一类setosa（山鸢尾）。

duoan / notes