FastForestBinaryClassifier always return same prediction

bzn7 commented 6 years ago

System information

OS version/distro: Windows 10 17134.165
.NET Version (eg., dotnet --info): 4.7.2

Issue

What did you do? I tried to use FastForestBinaryClassifier for a learning application and a bool parameter as label.
What happened? Predicted Label always returns false despite most of my learning data is true. 5000/7000 result of train data true and data quantity is increasing while working. New results are real answers of previous predictions. Prediction always same as what it was at first the time for all kind of prediction input.

Source code / logs

class MLData
    {
        public class IrisPrediction
        {
            [ColumnName("PredictedLabel")]
            public bool PredictedLabels;
        }

        public class IrisData
        {
            [Column("0", name: "Label")] public bool Label;
            [Column("1")] [VectorType(1000)] public float[] param1;
            [Column("2")] [VectorType(1000)] public float[] param2;
            [Column("3")] [VectorType(1000)] public float[] param3;
            [Column("4")] [VectorType(1000)] public float[] param4;
            [Column("5")] [VectorType(1000)] public float[] param5;
            [Column("6")] [VectorType(1000)] public float[] param6;
        }

        public static List<IrisData> History = new List<IrisData>() { };
    }

class MLCore
    {
        private static string AppPath => Path.GetDirectoryName(Environment.GetCommandLineArgs()[0]);
        private static string ModelPath => Path.Combine(AppPath, "IrisModel.zip");
        private static PredictionModel<MLData.IrisData, MLData.IrisPrediction> readyModel;

        internal static async Task<PredictionModel<MLData.IrisData, MLData.IrisPrediction>> TrainAsync()
        {
            var data = MLData.History;
            var collection = CollectionDataSource.Create(data);

            var pipeline = new LearningPipeline()
            {
                collection,
                new ColumnConcatenator("Features", "param1","param2", "param3","param4", "param5", "param6"),

                new FastForestBinaryClassifier(),

                new PredictedLabelColumnOriginalValueConverter() { PredictedLabelColumn = "PredictedLabel" }
            };

            PredictionModel<MLData.IrisData, MLData.IrisPrediction> model;

            try
            {
                model = pipeline.Train<MLData.IrisData, MLData.IrisPrediction>();
                await model.WriteAsync(ModelPath);
                PGlobals.learnSuccesfull = true;
            }
            catch (Exception e)
            {
                model = null;
                PGlobals.learnSuccesfull = false;
            }

            return model;
        }

        public static async void Learn()
        {
            readyModel = await TrainAsync();
        }

        public static void Think()
        {
            if (readyModel != null)
            {
                try
                {
                    var prediction = readyModel.Predict(new MLData.IrisData()
                    {
                        param1 = PGlobals.param1,
                        param2 = PGlobals.param2,
                        param3 = PGlobals.param3,
                        param4 = PGlobals.param4,
                        param5 = PGlobals.param5,
                        param6 = PGlobals.param6
                    });

                    PGlobals.predictedResult = prediction.PredictedLabels;
                }

                catch
                {
                    //Nothing
                }
            }
        }
    }

Zruty0 commented 6 years ago

Do you actually have 6000 features? If yes, your [Column] should look like this:

            [Column("1-1000")] [VectorType(1000)] public float[] param1;
            [Column("1001-2000")] [VectorType(1000)] public float[] param2;
            // etc.

WladdGorshenin commented 6 years ago

Have the same issue (ML.NET v0.4). The classifier returns the same prediction (false).

Regarding the post by @bzn7 I consider it makes no difference how many features do a training set have - the answers must be different.

Zruty0 commented 6 years ago

Well, if you have 6000 features, but you read them the way @bzn7 does (994 features appear 6 times each), the learner is going to be severely hampered. My guess was that the model that was learned was trivial, and therefore gave the same prediction all the time.

I think you are incorrect about this one:

I consider it makes no difference how many features do a training set have - the answers must be different.

I would say that if the answers are 'the same all the time', it is unfortunate, but far from uncommon. Here are some factors that can potentially cause this:

Features have no predictive signal in them. In this case the model will learn the priors and output them all the time.
Heavy overfitting on the train set. In this case the testing example will not belong to the area the model has 'studied', and the performance will be arbitrary.
Heavy class imbalance. Especially in multiclass problems, if the classes are heavily imbalanced, the model will predict the majority class in 'far too many' cases.

Ivanidzo4ka commented 6 years ago

DRI RESPONSE: I'm considering this question as answered and intent to close issue within next few days, unless someone have objection.

bzn7 commented 6 years ago

Do you actually have 6000 features? If yes, your [Column] should look like this:

            [Column("1-1000")] [VectorType(1000)] public float[] param1;
            [Column("1001-2000")] [VectorType(1000)] public float[] param2;
            // etc.

I tried to simplify my features and redefined columns as shown. It is working, thank you @Zruty0.

dotnet / machinelearning