ShifuML / shifu

An end-to-end machine learning and data mining framework on Hadoop
https://github.com/ShifuML/shifu/wiki
Apache License 2.0
251 stars 109 forks source link

High Cardinal Categorical Variable(>10000) Processing Issue #708

Open zhangpengshan opened 4 years ago

zhangpengshan commented 4 years ago

Currently if categorical variable with categories > 10000, stats step such variable would be missed as no stats result.

Such behavior should be changed to like:

  1. Keep randomly 10000 categories at most and leave others to empty category;
  2. Keep top 10000 categories with roughly count per category.
zhangpengshan commented 4 years ago

How to leverage hash to enable multi cardinal categories?? May be a idea.