dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
8.98k stars 1.88k forks source link

Does GetFeatureWeights support categorical splits? #3766

Open rauhs opened 5 years ago

rauhs commented 5 years ago

Version: 1.0

I'm training my multi class LightGBM with mostly categorical features (which works great). But when I want to get the GetFeatureWeights from the binary predictors I get huge values for feature of index -1. Though, just inspecting the models in the debugger the trees almost exclusively use categorical features for the splits. It seems that the GainMap doesn't actually consider any categorical splits and just assigns all those gains to the index -1 which makes the feature weights vector completely useless in my case.

Is this something that will be supported? Or am I wrong here?

wschin commented 5 years ago

@codemzs, do you have any idea as you were working on categorical split?

antoniovs1029 commented 4 years ago

By working on issue #3272 (and PR #5018) it looks to me that that issue is similar in nature to the issue in here.

The issue in here is obtaining the Gains from the following code, which is obtaining the feature's indices from SplitFeatures[node].

https://github.com/dotnet/machinelearning/blob/8660ecc7742a02ac896dc77c93c67c366ff4647e/src/Microsoft.ML.FastTree/TreeEnsemble/InternalRegressionTree.cs#L1334-L1335

As explained here the SplitFeatures[] array has "-1" for categorical splits, and it shouldn't be used for such splits (CategoricalSplitFeatures[][] should be used instead). So to solve issue this issue here we should also add code to support categorical features when calculating the GainMap. Problem is that I don't know what's the "mathematically correct" way to do it. I will ask around to see if I can get the correct way to do it, and open a PR with that code as well.

antoniovs1029 commented 4 years ago

Hi, @rauhs . So it seems the solution to this should be very straightforward. Do you happen to have any repro to test this with the solution I'm working on, and see if the results are what you would expect? thanks!

rauhs commented 4 years ago

Thanks for working on this. I don't have a repo right now as I'm on vacation. If absolutely necessary I can provide some next week

antoniovs1029 commented 4 years ago

Thanks for answering, @rauhs . It's not completely necessary, as creating a model that uses categorical splits is easy. Still, if possible, I would really like to have a repro from your side, since it always help to get to know how users are using ML.NET 😄 So I would still like to wait to whenever you can provide a repro to test the solution. Thanks!

rauhs commented 4 years ago

With this code I get an indexoutofbound exception:

I can't reproduce the "-1" right now with using the synthetic data.

   public class GenericBinaryInstance
    {
      public string A { get; set; }
      public float Num { get; set; }
      public bool Label { get; set; }
    }

    public static string[] FeatureVector(int card, int total)
    {
      var inner  = Enumerable.Range(1, card).Select(x => x.ToString());
      var repeatCount = (int)Math.Ceiling((double)total / card) + 1;
      return Enumerable.Repeat(inner, repeatCount).SelectMany(x => x).ToArray();
    }

    public static void ReproduceLightGbmGetFeatureWeights()
    {
      var numInstances = 10_000;
      var axs = FeatureVector(4, numInstances);
      var labels = FeatureVector(2, numInstances);
      var rnd = new Random(1);
      var data = Enumerable.Range(1, numInstances).Select(x => new GenericBinaryInstance { A = axs[x], Num = (float)rnd.NextDouble(), Label = labels[x] == "1" });
      var ctx = new MLContext(1);
      var options = new LightGbmBinaryTrainer.Options
      {
        UseCategoricalSplit = true,
        MinimumExampleCountPerLeaf = 1,
        MinimumExampleCountPerGroup = 1,
      };
      options.Booster = new GradientBooster.Options();

      var pipe = ctx.Transforms.Conversion.MapValueToKey("A")
        .Append(ctx.Transforms.Conversion.MapKeyToVector("A"))
        .Append(ctx.Transforms.Concatenate("Features", "Num", "A"));
      var dataView = ctx.Data.LoadFromEnumerable(data);
      var trainer = ctx.BinaryClassification.Trainers.LightGbm(options);
      var encoder = pipe.Fit(dataView);
      var trainEncoded = encoder.Transform(dataView);
      var model = trainer.Fit(trainEncoded);

      var weightsBuf = new VBuffer<float>();
      model.Model.SubModel.GetFeatureWeights(ref weightsBuf);
      var weights= weightsBuf.GetValues().ToArray();
    }