Open olivierlabayle opened 1 year ago
Some predicted values are indeed equal to 1 and some very close to zero. But that doesn't always indicate a problem. R prints a warning even though the model can sometimes be interpreted just fine, while in GLM.jl we tend to avoid printing warnings. Instead, if there's a problem you should be able to spot them due to very large standard errors on problematic variables I think. You can find more details by searching for the R error message in a search engine.
Thank you for your reply, I think I've made the situation a bit more clear now. Unless I'm missing something, I think the following code illustrates how the logloss reached by a GLM (in this specific example) is way worse than that of a constant classifier.
This is using MLJ for convenience as I wasn't sure how to extract the pdf from a glm model.
using CSV
using DataFrames
using MLJBase
using MLJModels
using MLJGLMInterface
data = CSV.read("glm_pb_dataset.csv", DataFrame)
X = data[!, [:covariate, :offset]]
y = categorical(data.y)
# Dummy mean classifier logloss
mach_const = machine(
ConstantClassifier(),
X,
y
)
fit!(mach_const)
ll_const = mean(log_loss(MLJBase.predict(mach_const), y))
# Fit using MLJGLMInterface
mach_glm = machine(
LinearBinaryClassifier(fit_intercept=false, offsetcol=:offset),
X,
y
)
fit!(mach_glm)
ll_glm = mean(log_loss(MLJBase.predict(mach_glm), y))
# What would be the logloss if the coefficient is set to 0?
mach_glm.fitresult[1].coefs[1] = 0.0
ll_glm_0 = mean(log_loss(MLJBase.predict(mach_glm), y))
println(ll_const)
println(ll_glm)
println(ll_glm_0)
Output:
0.6904066761889491
17.256097466399517
0.22942599075747547
I also join the dataset:
log_loss_pb_dataset.csv Hi,
This may be a bit of a long shot. I've noticed that on a specific dataset (attached) if I run a GLM with both GLM.jl and the R glm library, R will complain about: "glm.fit: fitted probabilities numerically 0 or 1 occurred", but GLM.jl will not. Is there anything particular which is done here to solve that issue, or is it just ignored?
Julia code:
R code: