JuliaStats / GLM.jl

Generalized linear models in Julia
Other
584 stars 114 forks source link

Numerical stability warning #522

Open olivierlabayle opened 1 year ago

olivierlabayle commented 1 year ago

log_loss_pb_dataset.csv Hi,

This may be a bit of a long shot. I've noticed that on a specific dataset (attached) if I run a GLM with both GLM.jl and the R glm library, R will complain about: "glm.fit: fitted probabilities numerically 0 or 1 occurred", but GLM.jl will not. Is there anything particular which is done here to solve that issue, or is it just ignored?

Julia code:

using CSV, GLM, DataFrames

data = CSV.read("log_loss_pb_dataset.csv", DataFrame)
X = reshape(data[!, :covariate], size(data,1), 1)
f = @formula(y ~ 0 + covariate)
glm(f, data, Bernoulli(), offset=data.offset)

R code:

data <- read.csv(
    file = "/Users/olivierlabayle/Dev/TMLE.jl/test/log_loss_pb_dataset.csv",
    stringsAsFactors = TRUE
    )
data$y <- as.numeric(data$y) - 1
res <- glm(y ~ 0 + covariate, family = binomial(), data, offset=data$offset)
nalimilan commented 1 year ago

Some predicted values are indeed equal to 1 and some very close to zero. But that doesn't always indicate a problem. R prints a warning even though the model can sometimes be interpreted just fine, while in GLM.jl we tend to avoid printing warnings. Instead, if there's a problem you should be able to spot them due to very large standard errors on problematic variables I think. You can find more details by searching for the R error message in a search engine.

olivierlabayle commented 1 year ago

Thank you for your reply, I think I've made the situation a bit more clear now. Unless I'm missing something, I think the following code illustrates how the logloss reached by a GLM (in this specific example) is way worse than that of a constant classifier.

This is using MLJ for convenience as I wasn't sure how to extract the pdf from a glm model.


using CSV
using DataFrames
using MLJBase
using MLJModels
using MLJGLMInterface

data = CSV.read("glm_pb_dataset.csv", DataFrame)
X = data[!, [:covariate, :offset]]
y = categorical(data.y)

# Dummy mean classifier logloss
mach_const = machine(
    ConstantClassifier(),
    X,
    y
)
fit!(mach_const)
ll_const = mean(log_loss(MLJBase.predict(mach_const), y))

# Fit using MLJGLMInterface
mach_glm = machine(
    LinearBinaryClassifier(fit_intercept=false, offsetcol=:offset),
    X,
    y
)
fit!(mach_glm)
ll_glm = mean(log_loss(MLJBase.predict(mach_glm), y))

# What would be the logloss if the coefficient is set to 0?
mach_glm.fitresult[1].coefs[1] = 0.0
ll_glm_0 = mean(log_loss(MLJBase.predict(mach_glm), y))

println(ll_const)
println(ll_glm)
println(ll_glm_0)

Output:

0.6904066761889491
17.256097466399517
0.22942599075747547

I also join the dataset:

glm_pb_dataset.csv