ShifuML / shifu

An end-to-end machine learning and data mining framework on Hadoop
https://github.com/ShifuML/shifu/wiki
Apache License 2.0
249 stars 109 forks source link

Fix split issue for CategoricalBinning #768

Closed Liu-Delin closed 3 years ago

Liu-Delin commented 3 years ago

This fix is for categorical binning separator.

The issue:

  1. The binning string contains 7 fields currently: c1|c2|...|c7 (the separator is 0x0001 instead of |, I use | for better understanding).
  2. The value of c6 is categoricalVals which has separator charactor for some inputs: c6=c6_1|c6_2.
  3. At this point, the fields become c1|c2|...|c6_1|c6_2|c7. When we split the fields, we will get 8 fields and the fields[5] is c6_1 and fields[6] is c6_2 which is wrong.

Fix:

  1. Change the order to c1|c2|c3|c4|c5|c7|c6, because only c6 may has separator charactor.
  2. We will have this order in above example: c1|c2|c3|c4|c5|c7|c6_1|c6_2.
  3. Split the binning string up to 7 fields. We will have these fields: fields[5] is c7 and fields[6] is c6_1|c6_2.