Rambatino / CHAID

A python implementation of the common CHAID algorithm
Apache License 2.0
150 stars 50 forks source link

missing value in ordinal feature and bonferroni adjustment #124

Closed binkmust closed 2 years ago

binkmust commented 3 years ago

Sorry to trouble you. First, thank you for your project of CHAID.

Do you have read the pdf of (http://www.gad-allah.com/MBA%202010%20Ain%20Shames%20Univesity/Statistics/spss13/Algorithms/TREE-CHAID.pdf).

as url-pdf details:

  1. the adjusted p-value is calculated as p-value times a bonferroni multiplier
  2. for ordinal predictors, the algorithm first generates the best set of categories using all non-missing information from the data. next the algorithm identifies the category that is most similar to the missing category. finally the algorithm decides whether to merge the missing category with its most similar category or keep the missing category as a separate category . Two p-values are calculated, one for the set of categories formed by merging the missing category with its most similar category, and the other for the set of categories formed by adding the missing category as a separate category. Take the action that gives the smallest p-value.

after read the pdf file . I confuse where to add the process of ordinal feature with missing value(as 2 describe) in you project structure.

would you like to give some advice to implement. would you like to consider this two problem in the later version.

Thank you again for read the issue.

Rambatino commented 3 years ago

I have read that. We used that to construct parts of this algorithm.

So here is where the p_value is calculated: https://github.com/Rambatino/CHAID/blob/master/CHAID/stats.py#L150

I don't really have time to implement, but ideally there's be a configuration parameter --bonferroni or whatever, and then that is present, it would apply the bonferroni adjustment to that p_value. The rest of the algorithm should just carry on as normal.

binkmust commented 3 years ago

thank you for https://github.com/Rambatino/CHAID/blob/master/CHAID/stats.py#L150 advice.

what about the following process:

  1. for ordinal predictors, the algorithm first generates the best set of categories using all non-missing information from the data. next the algorithm identifies the category that is most similar to the missing category. finally the algorithm decides whether to merge the missing category with its most similar category or keep the missing category as a separate category . Two p-values are calculated, one for the set of categories formed by merging the missing category with its most similar category, and the other for the set of categories formed by adding the missing category as a separate category. Take the action that gives the smallest p-value.

can you give me some advice to implement in your project structure

Rambatino commented 3 years ago

Hmm I dunno if it needs to be that complex.

Given this:

Suppose that a predictor variable originally has I categories, and it is reduced to r categories
after the merging step. The Bonferroni multiplier B is the number of possible ways that I
categories can be merged into r categories. For r = I, B = 1. For 2 ≤ r < I, use the following
equation.

image

Aslong as you can calculate B, you can multiple through with the p value and adjust it, right?