cheetahbright / tsa-decision-trees

Decision tree implementation on a data set from the Transporation Security Administration.
0 stars 0 forks source link

Reduce Categorical Variables #5

Open malctaylor15 opened 6 years ago

malctaylor15 commented 6 years ago

Create function to reduce the number of categorical variables created

We can do something like, if there are less than 10 observations from that category, we can re group those observations.

A potential workflow can be

  1. Group by for each level in the variable
  2. Get the count for each level in that category 3, Create new level that will be for levels with less than x counts. We can set x to be 10 or the 10th decile or whichever is lower
  3. Fill new level appropriately

This is just an idea for a general data cleaning step