OLS regression outputs wrong TStats and PValue

zyzhu commented 3 years ago

System information

OS version/distro: Windows 10
.NET Version (eg., dotnet --info): .NET SDK (reflecting any global.json): Version: 5.0.200 Commit: 70b3e65d53

Runtime Environment: OS Name: Windows OS Version: 10.0.19042 OS Platform: Windows RID: win10-x64 Base Path: C:\Program Files\dotnet\sdk\5.0.200\

Issue

What did you do? I tried to use ML.Net to run a stats 101 case to get familiar with the library. The data points are generated so that y = x * 2 + random(). I use OLS trainer to estimate its slope and output its tstats and pvalues.
What happened? pValue turns out to be 1 and tstat turns out to be 0.
What did you expect? pValue is supposed to be close to zero and tstat is supposed to be very large.

Here is the equivalent R code

df <- data.frame(x = 1:100, y = 1:100*2 + runif(100))
model <- lm(y ~ x, df)
summary(model)

output of R

Residuals:
     Min       1Q   Median       3Q      Max 
-0.48638 -0.20409 -0.04365  0.22835  0.52931 

Coefficients:
             Estimate Std. Error  t value Pr(>|t|)    
(Intercept) 0.5067878  0.0562763    9.005 1.74e-14 ***
x           1.9994857  0.0009675 2066.691  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2793 on 98 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:      1 
F-statistic: 4.271e+06 on 1 and 98 DF,  p-value: < 2.2e-16

Source code / logs

The following is the F# script file. Or you can run it in Jupyter notebook via dotnet interactive kernel.

#r "nuget: Microsoft.ML"
#r "nuget: Microsoft.ML.Mkl.Components"
open System
open Microsoft.ML
open Microsoft.ML.Data

[<CLIMutable>]
type Factor = {
    [<ColumnName("Label")>]
    y : float32
    intercept: float32
    x : float32
}

// Generate data: y = x * 2 + rnd
let rnd = Random()
let rows =
    [1.0 .. 100.0]
    |> Seq.map(fun v ->
        {
            y = float32 (v * 2.0 + rnd.NextDouble())
            intercept = float32 1.
            x = float32 v
        }
    )

let context = new MLContext()
let dataView = context.Data.LoadFromEnumerable(rows)
let pipeline =
    EstimatorChain()
        .Append(context.Transforms.Concatenate("Features", "intercept", "x"))
        .Append(context.Regression.Trainers.Ols())

let model = dataView |> pipeline.Fit
let modelParams = model.LastTransformer.Model
Seq.zip3 modelParams.Weights modelParams.TValues modelParams.PValues
|> Array.ofSeq
|> Array.iteri(fun i (w, t, p) ->
    printfn $"Beta {i}, w: {w:f3}, tStats: {t:f3}, pValue: {p:f3}")

Output

Beta 0, w: 0.005, tStats: 0.000, pValue: 1.000
Beta 1, w: 2.000, tStats: 0.000, pValue: 1.000

Another general feedback is that the ceremony in ML.NET is so complicated, compared to the simplicity in R sample above. I do not expect users from R/Python community can embrace this complexity. The library seems to be designed for software engineers only in mind. Maybe there's a balance in between R/Python and dotnet.

michaelgsharp commented 3 years ago

Does this only happen with OLS?

zyzhu commented 3 years ago

I assume the concept of PValue for hypothesis testing is only valid in linear regression. Git blame points to the implementation here https://github.com/dotnet/machinelearning/blame/b7901957eeb7d328d9f36a6bf2386040e048949c/src/Microsoft.ML.HalLearners/OlsLinearRegression.cs#L323

dotnet / machinelearning