dotnet / machinelearning-samples

Samples for ML.NET, an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
4.45k stars 2.67k forks source link

How to do simple linear regression between two variables #211

Closed klaus78 closed 5 years ago

klaus78 commented 5 years ago

I wanted to use the linear regression to model the relationship between two variables (say age and salary where salary depends linearly on age).

I tried to use the class Regression.Trainers.OnlineGradientDescent but I get completely wrong results when I try to predict a salary from a new age value.

Is OnlineGradientDescent the correct class to use?

Many thaks

CESARDELATORRE commented 5 years ago

Can you show your pipeline code? Also, results will depend on the amount of data when training. How many observations do you have for the initial training?

You can also try other algorithms/trainers like:

Or try multiple trainers/algorithms and compare results like we do in the BikeSharing sample:

https://github.com/dotnet/machinelearning-samples/blob/master/samples/csharp/getting-started/Regression_BikeSharingDemand/BikeSharingDemand/BikeSharingDemandConsoleApp/Program.cs

klaus78 commented 5 years ago

This is my code

// dataset to be put into train.csv, 13 items alltogether age,salary 58,65000 39,43000 25,23000 62,70000 34,40000 19,22000 68,75000 43,45000 21,20000 66,67000 48,48000 24,24000

string trainDataPath = Path.Combine(Environment.CurrentDirectory, "Data", "train.csv");
MLContext mlContext = new MLContext(seed: 0);
TextLoader textLoader = mlContext.Data.CreateTextReader<SalaryData>(hasHeader: true, separatorChar: ',');
IDataView trainingDataView = textLoader.Read(trainDataPath);

var pipeline = mlContext.Transforms.CopyColumns("salary", "Label")
                        .Append(mlContext.Transforms.Concatenate("Features", "age"))
                        .Append(mlContext.Regression.Trainers.OnlineGradientDescent());

var model = pipeline.Fit(trainingDataView);
var predictEngine = model.CreatePredictionEngine<SalaryData, SalaryPrediction>(mlContext);
// we want to predict the salary from inputAge
float inputAge = 65;
var inputSalaryData = new SalaryData() { age = inputAge };
var predicted = predictEngine.Predict(inputSalaryData);
Console.WriteLine("predicted salary = " + predicted.salary);
// result is -1.06e+29, expected is something around 65000
Console.ReadKey();
CESARDELATORRE commented 5 years ago

How many rows do you have for your dataset? Only 12 or 13 rows/observations? I'd say that is a very low number. You usually need at least hundreds of observations so a model can get good quality and accuracy.

Also, you don't need to concatenate the column "age" since you have a single numeric column for the features. You can point to that column when specifying the trainer, like:

var trainer = mlContext.Regression.Trainers.StochasticDualCoordinateAscent(labelColumn: "Label", featureColumn: "age");

I'd need to try that code, but I'd initially say that you need a lot more of observations for a regression model to have significant accuracy. 12 rows/observations is a very, very low number.

klaus78 commented 5 years ago

I tried with the following algorithms and these are the results

// salary prediction for age 65 FastTree 0 Poisson 72656.4 StochasticDualCoordinateAscent 52637.33 FastTreeTweedie 1 FastForest NaN GeneralizedAdditiveMethod 0 OnlineGradientDescent -1.065749E+29

Only Poisson and StochasticDualCoordinateAscent algorithms provide a somewhat realistic prediction.

Since it is a simple linear regression problem I would expect the OnlineGradientDescent to perform well (it uses linear regression) even if the dataset is very small. So I wonder if this an issue or is normal.

shmoradims commented 5 years ago

@klaus78, the default parameters of ML.NET trainers are optimized based on many public datasets. So a small dataset like yours will cause some algorithms to degenerate. For instance, all the decision tree trainers (FastForest, FastTree, FastTreeTweedie) have a parameter minDatapointsInLeaves which is 10 by default. For small a dataset, that parameter causes the tree ensemble to be empty, hence the results that you see. If you try something like numLeaves = 2, numTrees = 5, minDatapointsInLeaves = 2, you probably get some reasonable results.

OnlineGradientDescent has a different issue. It needs the features to be normalized. If you add mlContext.Transforms.Normalize() (default mode is MinMax) before your trainer, you should get reasonable results, even with a small dataset.

Hope that helps.

klaus78 commented 5 years ago

Thanks for the clarification. I agree that the dataset is very small, however also the problem is very simple (a linear regression between 2 variables)

It is a linear algebra problem. With R you can solve it with the lm command.

So I assume that there is no equivalent function in ml.net. It is not an issue, it is just that I was trying to do the simplest example possible of regression before using more complex data sets.

shmoradims commented 5 years ago

mlContext.Regression.Trainers.OrdinaryLeastSquares() is the closest to R lm command. Please try it out.

klaus78 commented 5 years ago

mlContext.Regression.Trainers.OrdinaryLeastSquares() is not found in ml.net 0.9.0 with dotnet core 2.2.101 under Windows

In order to find it you have to install the Microsoft.ML.HalLearners package. This package is not installed with Microsoft.ML but must be installed extra.

Finally I could use mlContext.Regression.Trainers.OrdinaryLeastSquares() and I now get the expected result. Thanks

CESARDELATORRE commented 5 years ago

Closing this issue as it is stale now.