dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9.03k stars 1.88k forks source link

OLS regression outputs wrong TStats and PValue #5696

Open zyzhu opened 3 years ago

zyzhu commented 3 years ago

System information

Runtime Environment: OS Name: Windows OS Version: 10.0.19042 OS Platform: Windows RID: win10-x64 Base Path: C:\Program Files\dotnet\sdk\5.0.200\

Issue

Here is the equivalent R code

df <- data.frame(x = 1:100, y = 1:100*2 + runif(100))
model <- lm(y ~ x, df)
summary(model)

output of R

Residuals:
     Min       1Q   Median       3Q      Max 
-0.48638 -0.20409 -0.04365  0.22835  0.52931 

Coefficients:
             Estimate Std. Error  t value Pr(>|t|)    
(Intercept) 0.5067878  0.0562763    9.005 1.74e-14 ***
x           1.9994857  0.0009675 2066.691  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2793 on 98 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:      1 
F-statistic: 4.271e+06 on 1 and 98 DF,  p-value: < 2.2e-16

Source code / logs

The following is the F# script file. Or you can run it in Jupyter notebook via dotnet interactive kernel.

#r "nuget: Microsoft.ML"
#r "nuget: Microsoft.ML.Mkl.Components"
open System
open Microsoft.ML
open Microsoft.ML.Data

[<CLIMutable>]
type Factor = {
    [<ColumnName("Label")>]
    y : float32
    intercept: float32
    x : float32
}

// Generate data: y = x * 2 + rnd
let rnd = Random()
let rows =
    [1.0 .. 100.0]
    |> Seq.map(fun v ->
        {
            y = float32 (v * 2.0 + rnd.NextDouble())
            intercept = float32 1.
            x = float32 v
        }
    )

let context = new MLContext()
let dataView = context.Data.LoadFromEnumerable(rows)
let pipeline =
    EstimatorChain()
        .Append(context.Transforms.Concatenate("Features", "intercept", "x"))
        .Append(context.Regression.Trainers.Ols())

let model = dataView |> pipeline.Fit
let modelParams = model.LastTransformer.Model
Seq.zip3 modelParams.Weights modelParams.TValues modelParams.PValues
|> Array.ofSeq
|> Array.iteri(fun i (w, t, p) ->
    printfn $"Beta {i}, w: {w:f3}, tStats: {t:f3}, pValue: {p:f3}")

Output

Beta 0, w: 0.005, tStats: 0.000, pValue: 1.000
Beta 1, w: 2.000, tStats: 0.000, pValue: 1.000

Another general feedback is that the ceremony in ML.NET is so complicated, compared to the simplicity in R sample above. I do not expect users from R/Python community can embrace this complexity. The library seems to be designed for software engineers only in mind. Maybe there's a balance in between R/Python and dotnet.

michaelgsharp commented 3 years ago

Does this only happen with OLS?

zyzhu commented 3 years ago

I assume the concept of PValue for hypothesis testing is only valid in linear regression. Git blame points to the implementation here https://github.com/dotnet/machinelearning/blame/b7901957eeb7d328d9f36a6bf2386040e048949c/src/Microsoft.ML.HalLearners/OlsLinearRegression.cs#L323