bethatkinson / rpart

Recursive Partitioning and Regression Trees
43 stars 23 forks source link

How to split for categorical variable #41

Closed JiaqiHu2021 closed 2 years ago

JiaqiHu2021 commented 2 years ago

In Binary Decision Tree there 2^(k-1) -1 possible split for categorical variable with k values, how is the optimal split selected?

bgreenwell commented 2 years ago

@JiaqiHu2021 there’s a computational shortcut that can be exploited for rpart’s built in splitting rules (e.g., Anova, Poissin, and gini) that avoids the need to search through all possible 2^(k-1)-1 splits. These methods are described in Section 5 of the User Written Splitting Functions vignette.

JiaqiHu2021 commented 2 years ago

Thank you very much! Is there any formal proofs about these splitting rules(gini, poisson)? I know for categorical variable x and continuous y, the proof is in Breiman(1984), but I would not find other material.

bgreenwell commented 2 years ago

I think the original (and more general) proof is due to Fisher. You can find another proof in the tree chapter of Ripley’s Pattern Recognition book. I think he provides additional references as well.

JiaqiHu2021 commented 2 years ago

Thank you for your help!